============== Data analytics ============== .. image:: images/Data_Analytics.png :class: with-shadow :scale: 90 Click on the Data analytics icon, which will redirect you into the project creation screen. This will take the users into the screen where the project can be created. .. figure:: images/data_analysis.png :class: with-shadow :scale: 100 *Data Analysis* * A - For project creation, click on “Create project” which will redirect you to your data sources (screen number 4) * B - Here you can see a list of all your current projects * C - By clicking on the tick box next to your project, you can select the project * D - Delete the selected project .. image:: images/projects.png :class: with-shadow :scale: 70 Other possible actions with the selected project: * E - Load the data to perform analysis * F - Delete a file from the project * G - Download the data into the local system as CSV file Data sources ============ Connecting to the file system ------------------------------- .. figure:: images/data_sources.png :class: with-shadow :scale: 70 *Data Sources- Screen number 4.1* * A - For uploading a file from your computer, select file upload * B - Select file that you are planning to upload * C - Select the file type - CSV, JSON, Parquet, Avro, S3 or HDFS .. figure:: images/file_upload.png :class: with-shadow :scale: 70 *file upload* * D - Select your delimiters – tab, Semicolon, Comma, Space, Other * E - After completing previous steps click on Add files Connecting to Database ------------------------------- .. figure:: images/conn_database.png :class: with-shadow :scale: 70 *Data Sources- Screen number 4.2* * A - For uploading a file from a database, select DB connection * B - If you wish to upload a new file, select New Connection tab * C - If you already have a connection, select Existing Connection tab .. figure:: images/exist_dbconnection.png :class: with-shadow :scale: 70 * D - Within new connection, select the database type from the following options .. figure:: images/connection.png :class: with-shadow :scale: 70 * E - Fill in your username and password DB Schemas and Access ===================== This section is about the data upload into SEDGE from different types of database. SEDGE has the functionality of connecting to different database such as MYSQL, Mariadb, Oracle and Postgresql. The future release of SEDGE will also cover other database such as Cassandra, SQLite, Presto, Redshift and Redis. If you need assistance in connecting to these DB, talk to our support team, and they can provide functionality to connect to these databases also Accessing Database ----------------------- .. figure:: images/DB_Schemas_and_Access.png :class: with-shadow :scale: 70 *DB Schemas and Access* * A - For uploading a file from a database, select DB connection * B - If you wish to upload new file, select New Connection tab * C - If you already have a connection, select Existing Connection tab Connecting to DB -------------------- .. figure:: images/existing_connection_blank.png :class: with-shadow :scale: 70 *existing connection 1* * D - Within new connection, select the database type from the following options, we are demonstrating our example on the MYSQL database .. figure:: images/Database_type.png :class: with-shadow :scale: 70 *database type* * E - Fill in the class name of your database * F - Fill in the host name assigned to your database * G - Fill in the port number * H - Fill in the username * I - Fill in the password * J - Proceed with the connect button -------------------- Selecting existing DB Connections ------------------------------------ .. figure:: images/save_connection.png :class: with-shadow :scale: 70 *save connection* After filling in the above table a new window appears * A - Fill in the Database and Connection Name * B - Save connection (this will then appear in the Existing Connections tab) * C - Or proceed directly to SQL Editor .. figure:: images/existing_connections.png :class: with-shadow :scale: 70 *existing connections 2* Your project was successfully uploaded and is among the Existing Connections * A - You can open the project by clicking on the Connect button and you will be redirected to the SQL editor * B - You can erase the project by clicking on the Bin button -------------------- Working with SQL editor ------------------------------- .. figure:: images/SQL_editor.png :class: with-shadow :scale: 70 *SQL editor* * A - In the SQL editor your datasets will appear under the Entities box, for our demonstration we have uploaded three medical related datasets * B - In the SQL Query generator box, you can write your commands, for example when you wish to open one of your datasets, or do joins of the tables * C - After finishing your query, click on the Execute button to proceed to the preview -------------------- Writing SQL query in Editor ------------------------------- .. figure:: images/SQL_editor_query_1.png :class: with-shadow :scale: 70 *SQL editor query* * A - After clicking on one of the datasets, the query generator automatically selects this file * B - After selecting Execute button, we can see the preview of the data * C - Data preview (top 100 rows) * D - If we want to proceed with this dataset, we continue by clicking on the Save button Joining multiple tables in SQL query Editor --------------------------------------------- .. figure:: images/SQL_join.png :class: with-shadow :scale: 70 *SQL join* *SQL Join query example* We can use the SQL Query Generator in many ways, for example to combine records from two or more tables. There are 4 types of joins – Inner, Full, Left or Right Join. In our example, we are performing a left join – putting together records from the first (left-most) table with matching right table records. As a result, we get a left join with table “cancer diagnostic” and “diabetes” using as a condition patient ID. This means we will get one final table, that will combine the information on the respective patient with same ID. ETL Process ============= .. image:: images/SEDGE.png :alt: SEDGE Logo :target: https://sedgeanalysis.solverminds.net/ :class: with-shadow :scale: 100 File upload ============= .. figure:: images/file_upload2.png :class: with-shadow :scale: 70 *upload file - Screen no 5* * A - After adding your files, which might take a while, depending on the file size, click on the upload button, which will redirect you to the data preview page. * B - Users also have the option of zipping the CSV file, as large size CSV file can be uploaded. * C - Users can enter the name of the project (optional). If the users have not entered the name of the file then the system will take the name of the file and set a default project number. Multiple upload --------------------- It is also possible to upload multiple files simultaneously. SEDGE allows to upload multiple files in one upload form. .. figure:: images/Multipleuploads.png :class: with-shadow :scale: 70 *Multipleupload* File collaboration ==================== File collaboration in SEDGE facilitates the sending and receiving of files and collaboration among users. This functionality helps in collaborating and exchanging opinions with the other users in SEDGE. This offers a wide array of advantages, especially among users from different geographic locations and diverse timelines. It only takes a single click to send and enable the file access to the user. .. figure:: images/Filecollab.png :class: with-shadow :scale: 70 *Filecollaboration* Data preview ============= .. figure:: images/data_preview.png :class: with-shadow :scale: 70 *Data Preview - Screen no 6* For our demonstration, we have chosen the “Titanic” data. In the future steps, we will be forecasting, based on the available data, what was the chance of survival on board of Titanic. In this case, the factors taken into consideration are: class, name, sex, age, number of family members, parch, ticket number, fare, cabin, and point of embarkment. Some of these factors might not be relevant for the predictive analysis, so we will also demonstrate how to erase or ignore some of the data. .. note:: This page is just a preview of the full data, only a small sample is displayed over here to get the first image. * A - This row displays all the variables in the current analysis. You can sort the data by clicking on the arrows next to the respective variable * B - This row identifies if the data in a column are categorical or numerical (C or N) * C - Monitor in which phase of the analysis you are currently located – identified by the green colour * D - Information about the total number of rows and columns in the uploaded file * E - Action buttons to move either to the next step or return to the previous step Profile Dataset ============================== .. figure:: images/profile_dataset.png :class: with-shadow :scale: 70 *Profile Dataset - Screen no 7* * A - Dataset info – basic information about the respective data set like number or variables, number of observations/rows, percentage of missing data, warnings * B - List of all columns/variables in the data set * C - Information if the variable is categorical or numerical * D - How many unique values are there within each variable (sex – male or female – 2 possible values, name – unlimited number of values) * E - Total number of missing values within the respective variable * F - Number of missing values represented in percentage * G - Mean – the average value of the respective numerical variable. Found by adding all numbers in the data set and then dividing by the total number of variables in the data set * H - Median – the middle value in the data set – value that separates the higher half of the data sample from the lower half * I - SD – standard deviation represents how close or far the values are dispersed from the mean * J - In this tab, you have to select the target you are going to be predicting. It is one of the variables that are already part of the dataset. In our case, we are going to be predicting if the Titanic passenger did or did not survive .. figure:: images/feature_list.png :class: with-shadow :scale: 70 *feature list* * K - Sampling of data – you can base your research only on a limited part of the dataset (for example when the dataset is extremely large). * If you wish to sample your data, after clicking yes, you have to select the type * In the next step select the size type – percentage, absolute value or by the calculator .. figure:: images/sampling.png :class: with-shadow :scale: 70 *sampling* * Select the percentage of the sample you are going to base your analytics on .. figure:: images/sampling_percent.png :class: with-shadow :scale: 70 *sampling percent* * Select the variable .. figure:: images/variable_selection.png :class: with-shadow :scale: 70 *sampling percent* * After selecting your preferences click on Apply button * L - GDPR - Data pseudonymization * Depending on the nature of your data, you can choose if you wish to protect personal data or no. If you select yes, next step will take you to the GDPR page where you will select which data you want to hide. If you do not need to hide any of the data, select NO button and the GDPR page will be skipped .. figure:: images/data_protection.png :class: with-shadow :scale: 70 *data protection* .. figure:: images/GDPR.png :class: with-shadow :scale: 70 *GDPR* * After filling in the GDPR form, you can select if you want to run and Automatic Machine Learning or a Custom Machine Learning * M - Machine learning (ML) * If in the previous step you have selected “Yes” for GDPR protection, you will find the ML option under the GDPR form. If you selected “No” you can select this option under this tab .. figure:: images/machine_learning.png :class: with-shadow :scale: 70 *Machine learning* Statistics =========== Statistics is the branch of mathematics concerned with collecting, analyzing, interpreting, presenting, and organizing data. It encompasses a wide range of techniques and methodologies used to make sense of data, including summarizing data through measures based on samples. On the statistics page, there are several top functions available, which offer: - Numerical Operations - Strings - Dates - Maths - Text - Data - Advanced - Boolean - Frequently used To navigate to the Statistics page from the profiling page, click on the next button or click the statistics icon. Main Screen ------------ .. figure:: images/mainscreen.png :class: with-shadow :scale: 90 *Main Screen* **A:** In the first row, all the possible data operations are displayed – sorted into eight different categories: Numerical, String, Dates, Maths, Text, Data, Advanced, Boolean, and Frequently used. Each tab will be discussed separately on the following pages. **B:** The first window in the statistics display showcases the fundamental statistics of the dataset, with the target highlighted in green, as indicated by the legend. **C:** Information we can get in this window is: - **Column name:** Shows the columns or variables that are present in the dataset. - **Value type:** Numerical, Categorical, Date, Text. - **Unique Values:** Number of unique values in the variable. - **Mean:** The average value of the variable. - **Median:** The middle value of the variable. - **Standard Deviation:** SD - Measures how the data is spread around the mean. - **Missing Values:** Number of missing values of the variable in the total count. - **% of Missing Values:** The number of missing values in the dataset is represented as a percentage. Upon clicking on a column name, the other charts dynamically adjust based on the selection. To perform operations with a specific column, simply click on the checkbox corresponding to that column and proceed the functions. **D:** Graphstat - The graphical display illustrates our chosen value – in this case, Sales is selected for demonstration purposes. In this tab, the distribution of sales can be observed. Additionally, options are available to view the display in full-screen mode, download the graphs in different formats, print the chart, or utilize data binning to divide the dataset into appropriate bins (which refers to grouping data into ranges of values). Furthermore, if a Scatter Plot is required, the option can be enabled. This will plot for the target variable and the selected variable. .. figure:: images/bins.png :class: with-shadow :scale: 100 *Bins* .. figure:: images/options.png :class: with-shadow :scale: 100 **E:** Transformations - Under this category, various available transformations can be applied to modify the data as needed. To perform transformations, refer to the syntax provided, and you can also clear, test, and apply the syntax. Additionally, view the Expression History. **F:** Preview - Displays preview of the whole dataset, top 50 rows **G:** Advanced statistics - In this tab, advanced statistics for the selected row are displayed; in this case, the 'Sales' row is selected. **H:** In this line, corresponding information from the advanced statistics is displayed: Minimum value, 5th percentile, First Quartile (25th percentile), Third Quartile (75th percentile), Interquartile Range (IQR), 95th percentile, Maximum value, Variance (measure of data dispersion), Skewness (measure of data symmetry), Kurtosis (measure of data peakiness or tails), Outliers (data points significantly distant from the rest), and the option to delete the Outliers. **J:** Variable of Importance - In this window, all the rows of the dataset are visible, sorted by their importance towards the selected target. In this case, Sales is selected as the most important factor, although it may not be accurate. The variable of importance can be refreshed using this icon (M). Additionally, a Chart Context Menu is available where you can view in full screen, print the chart, download the image as PNG, JPEG, PDF, SVG, and download the data as CSV and XLS. Advanced Statistics -------------------- Box Plot will display for Numerical variable. Hovering over the chart the following information will be displayed: **Upper Fence** - The maximum limit beyond which data points are considered outliers. **Upper Quartile** - The median of the upper half of the dataset, separating the highest 25% of values from the rest. **Median** - The middle value of the dataset when arranged in ascending order. **Lower Quartile** - The median of the lower half of the dataset, separating the lowest 25% of values from the rest. **Lower Fence** - The minimum limit beyond which data points are considered outliers. .. figure:: images/ADboxplot.png :class: with-shadow :scale: 90 When hovering over outliers, the outlier observations will be displayed. .. figure:: images/ADbpoutliers.png :class: with-shadow :scale: 90 Data Table ------------- For Categorical and Date variables, the Data Table will be displayed. .. figure:: images/ADcat.png :class: with-shadow :scale: 90 .. figure:: images/ADdate.png :class: with-shadow :scale: 90 Statistics- Numerical Operations ---------------------------------- .. figure:: images/Numericaloperations.png :class: with-shadow :scale: 90 *image Numerical Operations* Absolute Value ~~~~~~~~~~~~~~~~~~~ The operation converts a numerical value into its absolute value, exclusively accepting numerical inputs. For instance, temperature can be transformed into its absolute value to determine its distance from zero. Upon applying an operation, successful validation appear in the top-right corner. A newly generated column is displayed at the bottom of the preview. After column addition, the variable of importance may alter; to observe its impact. .. figure:: images/absoperation.png :class: with-shadow :scale: 90 *image showing message notification after ABS function (absolute)* Ceiling ~~~~~~~~~~~~~ Returns the closest integer greater than or equal to a given number, used for getting the nearest integer up Floor ~~~~~~~~ Returns the closest integer less than or equal to a given number, used for getting the nearest integer up Unit converter ~~~~~~~~~~~~~~~~~~~ Simplifies converting measurements between different units. For example, Converting inches to centimeters or pounds to kilograms using a unit converter. Numerical conversion to hexadecimal form ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Convert from a standard numerical format to hexadecimal system, where 16 symbols are used compared to 10 symbols used in the numerical system Rounding of values ~~~~~~~~~~~~~~~~~~~~~~~ Round numbers to a specified number of digits, such as rounding housing measurements to 2 digits. This function should allow for manual or automatic selection of the column name and if required, enable the user to manually choose the name of the new column. .. figure:: images/roundoff.png :class: with-shadow :scale: 90 *image round* Num –> Cat ~~~~~~~~~~~~~ Converting a numerical column to a categorical format. In case of missing values, they can be managed by either deletion or imputation, techniques elaborated upon in the forthcoming "Data" chapter. Upon applying this function, the successful validation will appear. .. figure:: images/Numtocat.png :class: with-shadow :scale: 90 Num format conversion ~~~~~~~~~~~~~~~~~~~~~~~~~ Helps in generating the numeric text of values in different languages. The value will be displayed in words in the chosen language. .. image:: images/Numformat.png :class: with-shadow :scale: 90 *Num format conversion* .. note:: Numerical operations can be performed only to the Numerical variable. Statistics- String ------------------ .. figure:: images/string.png :class: with-shadow :scale: 90 *String* Substring ~~~~~~~~~~ A Substring is a contiguous sequence of characters within a larger string. It is a portion of a string that is extracted or considered separately based on its starting position and length. Steps to Perform ^^^^^^^^^^^^^^^^^ - Select a column, then click on 'Substring' - Choose position (Right, Middle, Left, Exclude String, or Split by Character/Pattern) based on your requirement. - Define length and choose 'Auto' or 'User Defined' - Click on 'Apply' Right ^^^^^^ For the string "USA", the selected position is "Right" with a length of 1. This operation will return the last character of the string. The output will be: **A**. .. figure:: images/substringright.png :class: with-shadow :scale: 90 *Example for substring for the position - Right* Middle ^^^^^^^ For the string "houstan", the selected position is "Middle" with a start index 1 and length of 5. This operation will return the middle portion. For the middle position it is required to select the start index. The output will be: **houst**. .. figure:: images/substring_middle.png :class: with-shadow :scale: 90 *Example for substring for the position - Middle* Left ^^^^ For the string " New York", the selected position is "Left" with a length of 3. This operation will return the leftmost 3 characters. The output will be: **New**. .. figure:: images/substringleft.png :class: with-shadow :scale: 90 *Example for substring for the position - Left* Exclude String ^^^^^^^^^^^^^^ For the string "New York", the selected position is "Exclude String" with a length of 3 and a start index of 1. This operation will exclude the specified characters from the string. The output will be: **York**. .. figure:: images/substringexclude.png :class: with-shadow :scale: 90 *Example for substring for the position - Exclude String* Split by character/pattern ^^^^^^^^^^^^^^^^^^^^^^^^^^ For the string "Los Angels" and the selected character "s" with the position "after", the operation will split the string after each occurrence of the character "s". The output will be: **Angel**. .. figure:: images/substringsplit.png :class: with-shadow :scale: 90 *Example for substring for the position - Spelit by Character/Pattern* .. note:: For the position "Middle" and "Exclude String" it is required to select the start index. For the position "Split by character/pattern" it is required to choose either "Before" or "After. Replace ~~~~~~~ Replace one or more characters with another pattern/set of strings. Under 'Pattern which exist,' provide the pattern that needs to be changed. Then, under 'Pattern to be replaced with,' specify the pattern for replacement. To replace "UK" with "United Kingdom," provide the following: .. figure:: images/replaceimg.png :class: with-shadow :scale: 90 - Pattern which exists: "UK" - Pattern to be replaced with: "United Kingdom" The output will be: **Hi** .. figure:: images/replace.png :class: with-shadow :scale: 90 *Example for Replace* Replace range ~~~~~~~~~~~~~~ Replace set of strings based on its occurrence index value. specify the start and end indices of the substring to be replaced along with the replacement text. Example: Original word: "Chine" To replace a range of characters from index 1 to index 5 with "India" provide the following, - Start index: 1 - End index: 5 - String to be replaced: "India" .. figure:: images/rr.png :class: with-shadow :scale: 90 After completing the replacement operation, the updated word will be: **India** .. figure:: images/replacerange.png :class: with-shadow :scale: 90 *Example for ReplaceRange* .. note:: After performing the operation a new column will appear at the bottom. Delete range ~~~~~~~~~~~~ Deletes a set of characters based on the index values. To delete a range, specify the start and end indices, then click "Apply". Example: Original String: Los Angels To remove 'Ang' from "Los Angels", specify the indices as follows: - Start index: 4 - End index: 8 .. figure:: images/dr.png :class: with-shadow :scale: 90 After applying this deletion, the updated word will be **Losels**. .. figure:: images/deleterange.png :class: with-shadow :scale: 90 *Example for DeleteRange* Append ~~~~~~ Append a single or multiple character in each class/row. Example: - Original String: "Japan" - To append "ese" to the end of the word, simply add "ese" to it. .. figure:: images/ap.png :class: with-shadow :scale: 90 After appending, the updated word will be: **Japanese*. .. figure:: images/append.png :class: with-shadow :scale: 90 *Example for Append* Insert ~~~~~~~ Select the index or position where insertion is desired, then specify the value to be inserted. Example: Original string: "Phoenix" - Index to be inserted: 8 - Inserted value: "City" .. figure:: images/ins.png :class: with-shadow :scale: 90 After insertion, the updated string will be: **PhoenixCity**. .. figure:: images/insert.png :class: with-shadow :scale: 90 *Example for Insert* Repeat ~~~~~~~ To choose the number of times the selected column should be repeated. Example: - Original string: "UK" - To repeat "UK" two times, simply specify the repeat count as 2. After repeating, the updated string will be: **UKUK**. .. figure:: images/repeat.png :class: with-shadow :scale: 90 *Example for Repeat* Pop ~~~ Pop removes and returns the element at the specified index. Example: - Original string: "Chicago" - To remove 'o' from the string by specifying the index as 7. .. figure:: images/popimg.png :class: with-shadow :scale: 90 After the operation, the updated string will be: **Chicag**. .. figure:: images/pop.png :class: with-shadow :scale: 90 *Example for Pop* Reverse item ~~~~~~~~~~~~~ Iterate through the string in reverse order. Example: - Original string: "UK" - Reversed string: "KU" .. figure:: images/reverse.png :class: with-shadow :scale: 90 *Example for Reverse* Upper ~~~~~ Convert a column to upper case. "New York" will be be converted into "NEW YORK". .. figure:: images/upper.png :class: with-shadow :scale: 90 *Example for Upper* Lower ~~~~~ Convert a column to lower case "New York" will be be converted into "new york" .. figure:: images/lower.png :class: with-shadow :scale: 90 *Example for Lower* Find ~~~~~ To determine if an entry matches a searched string, compare each entry against the specified string. If a match is found, return the matched string. If no match is found, return False to indicate that the string is not present in the column. .. figure:: images/find.png :class: with-shadow :scale: 90 *Example for Find* Length ~~~~~~ Count the number of characters in the string. Example: - String: "Chicago" - Length of the string: 7 .. figure:: images/length.png :class: with-shadow :scale: 90 *Example for Length* Trim ~~~~~ Trimming/erasing of leading & trailing spaces from the string. .. figure:: images/trim.png :class: with-shadow :scale: 90 *Example for Trim* CountOccurence ~~~~~~~~~~~~~~~ Determines the number of times a specific substring appears. Steps to perform ^^^^^^^^^^^^^^^^^ - Select Substring to specify a substring pattern. - Select Regex to provide a regular expression pattern. .. figure:: images/co.png :class: with-shadow :scale: 90 Example: - String: "Japan" - Substring pattern: "a" .. figure:: images/cos.png :class: with-shadow :scale: 90 The output will be **2**. .. figure:: images/countoccurence.png :class: with-shadow :scale: 90 *Example for Count occurence* Extract ~~~~~~~ To extract specific information from a string. Select the Method such as: First word, Last word, Numbers extract. .. figure:: images/ext.png :class: with-shadow :scale: 90 **First word:** - Original String: New York - The output will be: **New** .. figure:: images/extract(first).png :class: with-shadow :scale: 90 *Example for Extract - First Word* **Last word:** - Original String: New York - The output will be: **York** .. figure:: images/extract(last).png :class: with-shadow :scale: 90 *Example for Extract - Last Word* **Numbers extract:** - Original String: Hello, I am 22 years Old now turning to 23 - The output will be: **2223** .. figure:: images/numextract.png :class: with-shadow :scale: 90 *Example for Numbers Extract* Split ~~~~~~ Splitting is the act of partitioning available data into portions, usually for cross-validatory purposes. Below are the methods of splitting. * HTTP string * Email * Invalid * URL .. figure:: images/split1.png :class: with-shadow :scale: 80 *image Split* Split Http string ^^^^^^^^^^^^^^^^^ This processor splits the elements of an HTTP query string. Split email ^^^^^^^^^^^ This processor splits an e-mail address into the local part and the domain. Split invalid ^^^^^^^^^^^^^^ This processor takes all values of a column that are invalid. Split URL ^^^^^^^^^^ This processor splits the elements of an URL into multiple columns. ExtractNum ~~~~~~~~~~ To extract numerical values from text This function is useful for isolating and retrieving numbers embedded within text, simplifying the process of data cleaning and preprocessing. Cat -> Num ~~~~~~~~~~~ Convert categorical variables into numerical values only when the selected categorical column contains numbers exclusively. Upon applying this function, the successful validation will appear. .. figure:: images/cattonum.png :class: with-shadow :scale: 90 .. note:: String Operations can be performed only to the String variables. Statistics- Dates ------------------- .. figure:: images/Stat-dates.png :class: with-shadow :scale: 100 Dates Extraction ~~~~~~~~~~~~~~~~~~ The Dates function with the menu, helps to strip out the date, month, Year, etc., from a date column. These stripped out date, month and year, act as additional feature for the model. Example: Lets try to extract Month from the "Date of Hire" column, .. figure:: images/extracteg.png :class: with-shadow :scale: 100 | .. figure:: images/extractoutput.png :class: with-shadow :scale: 100 Dates Pattern Conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~ This function is used to convert the date pattern in a different, more suitable way for the research. After selecting the input and desired output pattern click on the apply button. Example: Lets convert the date pattern, from "yyyy-MM-dd HH:mm:ss" to "yyyy-MM-dd" .. figure:: images/dpconvert.png :class: with-shadow :scale: 100 This will covert the existing column with the selected date pattern. .. figure:: images/dpconvertoutput.png :class: with-shadow :scale: 100 There are several types of date patterns and shortcuts, that can be present in a date variable. Can Design own format patterns for dates and times from the list of symbols in the following table: Date Format Examples ^^^^^^^^^^^^^^^^^^^^^ +--------------+--------------------------------------------+-----------------------------+ | Date format | Description | Examples | +==============+============================================+=============================+ | EEEE | Day of the week | Monday, Tuesday | +--------------+--------------------------------------------+-----------------------------+ | EEE | Abbreviated day of the week | Mon, Tue, Wed | +--------------+--------------------------------------------+-----------------------------+ | MM | Month of the year in a two-digit format | 03, 07, 12 | +--------------+--------------------------------------------+-----------------------------+ | MMM | Abbreviated month of the year | Jan, Feb, Mar | +--------------+--------------------------------------------+-----------------------------+ | MMMM | Month of the year | January, February, March | +--------------+--------------------------------------------+-----------------------------+ | dd | Day of the month | 01, 15, 30 | +--------------+--------------------------------------------+-----------------------------+ | D | Day in the year | yyyy-DDD ---> 1999-081 | +--------------+--------------------------------------------+-----------------------------+ | yy | Year in two-digit format | 99, 12, 20 | +--------------+--------------------------------------------+-----------------------------+ | yyyy | Year in four-digit format | 1999, 2012, 2020 | +--------------+--------------------------------------------+-----------------------------+ | YYYY | Week-based year | 2009; 09 | +--------------+--------------------------------------------+-----------------------------+ | HH | Hour of the day (0-23) | 00, 12, 23 | +--------------+--------------------------------------------+-----------------------------+ | hh | Clock hour in AM/PM (1-12) format | 00, 05, 12 | +--------------+--------------------------------------------+-----------------------------+ | mm | Minute in hour | 00, 30, 59 | +--------------+--------------------------------------------+-----------------------------+ | ss | Second in minute | 00, 30, 59 | +--------------+--------------------------------------------+-----------------------------+ | SSS | Fraction of a second | 235, 512, 820 | +--------------+--------------------------------------------+-----------------------------+ | a | AM/PM marker | AM, PM | +--------------+--------------------------------------------+-----------------------------+ | G | Era designator | AD, BC | +--------------+--------------------------------------------+-----------------------------+ | u | Day number of week (1 = Mon,.., 7 = Sun) | 1, 2, 7 | +--------------+--------------------------------------------+-----------------------------+ | ww | Week in year | 02, 27, 52 | +--------------+--------------------------------------------+-----------------------------+ | W | Week in month | 1, 2, 4 | +--------------+--------------------------------------------+-----------------------------+ | z | General time zone | GMT, PST, SGT, CET | +--------------+--------------------------------------------+-----------------------------+ | Z |UTC time offset | +0000, -0800, +0100 | +--------------+--------------------------------------------+-----------------------------+ | XXX | ISO 8601 time zone | +01:00, -08:00 | +--------------+--------------------------------------------+-----------------------------+ | ’ | Escape for text | H' h 'mm z ---> 5 h 06 CET | +--------------+--------------------------------------------+-----------------------------+ Time pattern examples ^^^^^^^^^^^^^^^^^^^^^ +-------------------------------------------+--------------------------------------------------+ | Date and time pattern | Example | +===========================================+==================================================+ | yyyyMMddZ | 19990322+0100 | +-------------------------------------------+--------------------------------------------------+ | yyyyMMdd | 19990322 | +-------------------------------------------+--------------------------------------------------+ | yyyy/MM/dd H:mm | 1999/03/22 5:06 | +-------------------------------------------+--------------------------------------------------+ | yyyy.MM.dd HH:mm:ss | 1999.03.22 05:06:07 | +-------------------------------------------+--------------------------------------------------+ | yyyy.d.M HH:mm:ss | 1999.22.3 05:06:07 | +-------------------------------------------+--------------------------------------------------+ |yyyy-MM-dd | 1999-03-22 | +-------------------------------------------+--------------------------------------------------+ |yyyy-MM-ddXXX | 1999-03-22+01:00 | +-------------------------------------------+--------------------------------------------------+ | yyyy-MM-dd'T'HH:mm:ss.SSSXXX | 1999-03-22T05:06:07.000+01:00 | +-------------------------------------------+--------------------------------------------------+ |yyyy-MM-dd'T'HH:mm:ss.SSS'Z' | 1999-03-22T05:06:07.000Z | +-------------------------------------------+--------------------------------------------------+ | yyyy-MM-dd'T'HH:mm:ss.SSS | 1999-03-22T05:06:07.000 | +-------------------------------------------+--------------------------------------------------+ | yyyy-MM-dd'T'HH:mm:ss | 1999-03-22T05:06:07 | +-------------------------------------------+--------------------------------------------------+ | yyyy-MM-dd HH:mm:ss.S | 1999-03-22 05:06:07.0 | +-------------------------------------------+--------------------------------------------------+ | yyyy-MM-dd h:mm:ss a | 1999-03-22 5:06:07 AM | +-------------------------------------------+--------------------------------------------------+ | yyyy-MM-dd H:mm:ss | 1999-03-22 5:06:07 | +-------------------------------------------+--------------------------------------------------+ | yyyy-MM-dd G | 1999-03-22 AD | +-------------------------------------------+--------------------------------------------------+ | yyyy-M-d HH:mm:ss | 1999-3-22 05:06:07 | +-------------------------------------------+--------------------------------------------------+ | yyyy-M-d h:mm:ss a | 1999-3-22 5:06:07 AM | +-------------------------------------------+--------------------------------------------------+ | yyyy-DDDXXX | 1999-081+01:00 | +-------------------------------------------+--------------------------------------------------+ | yyyy MMMM dd E | 2019 July 24 Wed | +-------------------------------------------+--------------------------------------------------+ | yyyy-ww | 1999-02 | +-------------------------------------------+--------------------------------------------------+ | yyyy-ww-u | 2001-27-3 | +-------------------------------------------+--------------------------------------------------+ | yyyy-'W'ww-u | 2001-W27-3 | +-------------------------------------------+--------------------------------------------------+ | yy/MM/dd HH:mm | 99/03/22 05:06 | +-------------------------------------------+--------------------------------------------------+ | yy/MM/dd H:mm:ss | 99/03/22 5:06:07 | +-------------------------------------------+--------------------------------------------------+ | MMMM d, yyyy h:mm:ss z a | March 22, 1999 5:06:07 CET AM | +-------------------------------------------+--------------------------------------------------+ | MMM.dd.yyyy | Mar.22.1999 | +-------------------------------------------+--------------------------------------------------+ | MMM d, yyyy h:mm:ss a | Mar 22, 1999 5:06:07 AM | +-------------------------------------------+--------------------------------------------------+ | MM/dd/yyyy HH:mm:ss | 03/22/1999 05:06:07 | +-------------------------------------------+--------------------------------------------------+ | MM/dd/yyyy h:mm:ss a | 03/22/1999 5:06:07 AM | +-------------------------------------------+--------------------------------------------------+ | MM-dd-yyyy HH:mm:ss | 03-22-1999 05:06:07 | +-------------------------------------------+--------------------------------------------------+ | MM-dd-yyyy h:mm:ss a | 03-22-1999 5:06:07 AM | +-------------------------------------------+--------------------------------------------------+ | HH:mm:ss dd/MM/yyyy | 05:06:07 22/03/1999 | +-------------------------------------------+--------------------------------------------------+ |HH:mm:ss dd-MM-yyyy | 05:06:07 22-03-1999. | +-------------------------------------------+--------------------------------------------------+ | EEEE, MMMM d, yyyy h:mm:ss a z | Monday, March 22, 1999 5:06:07 AM CET. | +-------------------------------------------+--------------------------------------------------+ | EEEE, MMMM d, yyyy h:mm:ss 'o''clock' a z | Monday, March 22, 1999 5:06:07 o'clock AM CET. | +-------------------------------------------+--------------------------------------------------+ | EEEE, MMMM d, yyyy | Monday, March 22, 1999 | +-------------------------------------------+--------------------------------------------------+ | EEEE, d MMMM yyyy HH:mm:ss 'o''clock' z. | Monday, 22 March 1999 05:06:07 o'clock CET. | +-------------------------------------------+--------------------------------------------------+ | EEEE, d MMMM yyyy | Monday, 22 March 1999 | +-------------------------------------------+--------------------------------------------------+ | EEE, d MMM yyyy HH:mm:ss Z | Mon, 22 Mar 1999 05:06:07 +0100 | +-------------------------------------------+--------------------------------------------------+ | EEE MMM dd HH:mm:ss z yyyy | Mon Mar 22 05:06:07 CET 1999 | +-------------------------------------------+--------------------------------------------------+ | dd/MMM/yy h:mm a | 22/Mar/99 5:06 AM | +-------------------------------------------+--------------------------------------------------+ | dd/MM/yyyy HH:mm:ss | 22/03/1999 05:06:07 | +-------------------------------------------+--------------------------------------------------+ | dd/MM/yyyy h:mm:ss a | 22/03/1999 5:06:07 AM | +-------------------------------------------+--------------------------------------------------+ | dd/MM/yyyy H:mm | 22/03/1999 5:06 | +-------------------------------------------+--------------------------------------------------+ | dd.MM.yyyy. HH.mm.ss z | 22.03.1999. 05.06.07 CET | +-------------------------------------------+--------------------------------------------------+ | dd-MMM-yyyy HH:mm:ss | 22-Mar-1999 05:06:07 | +-------------------------------------------+--------------------------------------------------+ | dd-MM-yyyy HH:mm:ss | 22-03-1999 05:06:07 | +-------------------------------------------+--------------------------------------------------+ | dd-MM-yy HH:mm | 22-03-99 05:06 | +-------------------------------------------+--------------------------------------------------+ | dd MMMM yyyy HH:mm:ss z | 22 March 1999 05:06:07 CET | +-------------------------------------------+--------------------------------------------------+ Characters that are not letters are treated as quoted text. That is, they will appear in the formatted text even if they are not enclosed within single quotes. Delete Date Range ~~~~~~~~~~~~~~~~~ Removing or deleting all records or entries that fall within a specified range of dates from a dataset. Steps to perform: - Select "From date" and "To date" - Select Keep range (will keep only the selected range of dates) or Delete range (will delete only the selected range of dates) based on your preferences. Example: Lets keep only the records from "6/19/1961" To "8/17/1970" .. figure:: images/deldatrange.png :class: with-shadow :scale: 100 This will alter the existing column and will keep only the selected range of values. Flag Date Range ~~~~~~~~~~~~~~~~ Marking or highlighting all records or entries that fall within a specified range of dates. This is often done to identify, review, or take specific actions on the data within that range. Example: Lets try to flag from "5/30/2011" to "2/19/2013" .. figure:: images/flagdatrang.png :class: with-shadow :scale: 100 Date Difference ~~~~~~~~~~~~~~~~ Calculation of the amount of time between two dates. This can be measured in various units such as days, months, or years. Steps to perform: - select Minuend (The number from which another number (the subtrahend) is to be subtracted) and Subtrahend (The number that is to be subtracted from the minuend). - choose output time unit from the dropdown menu. .. figure:: images/dropdown.png :class: with-shadow :scale: 100 Example: Lets try to do difference for the columns 'Date of Hire" and "Date of Termination" .. figure:: images/input.png :class: with-shadow :scale: 100 This will generate a new column at the bottom of the preview. .. figure:: images/dd.png :class: with-shadow :scale: 100 Flag Holiday ~~~~~~~~~~~~~ Specific date refers to a day designated for celebrating or commemorating a national flag, often marked by various patriotic activities. Steps to perform: - Select flag type from the dropdown either holiday or weekend. - Select Country or Region. - Click on Apply Example: Lets flag the Holiday for Australia. .. figure:: images/fh.png :class: with-shadow :scale: 100 The output will be in the form of boolean (true or false). .. figure:: images/fgoutput.png :class: with-shadow :scale: 100 Statistics- Maths ------------------ .. figure:: images/maths.png :class: with-shadow :scale: 100 *Statistics - Maths* Sign conversion ~~~~~~~~~~~~~~~~~~~~ Convert a positive to negative sign and vice versa. Choose between positive or negative conversion, and the result will be displayed in the existing column. Before applying sign conversion, .. figure:: images/originalsign.png :class: with-shadow :scale: 100 *Example for before applying sign conversion* After applying sign conversion, .. figure:: images/negativesign.png :class: with-shadow :scale: 100 *Example for after applying negative sign conversion* .. note:: This function will not generate a new column for the output. It will modify the original column. Binning ~~~~~~~~~~ Select Binning Type such as: Interval binning, Equal width and Custom Binning. Interval binning ^^^^^^^^^^^^^^^^^ Interval binning involves dividing the range of values into equal-sized intervals. For example, if the data ranging from 11 to 100 and want 5 bins, .. figure:: images/intervalbinning.png :class: with-shadow :scale: 100 *Example for Interval Binning* Equal width ^^^^^^^^^^^^ Equal width binning is a type of interval binning where each bin has the same width. For example, if the data representing the heights of students in a class ranging from 120 cm to 180 cm, and need to create 4 bins, .. figure:: images/equalwidthbinning.png :class: with-shadow :scale: 100 *Example for Equal Width Binning* Custom Binning ^^^^^^^^^^^^^^^ Custom binning involves defining bins based on specific criteria or requirements. Addition ~~~~~~~~~~~~ Add multiple columns. Atleast two columns are required. .. figure:: images/add.png :class: with-shadow :scale: 100 *Example for Addition of two columns* Subtraction ~~~~~~~~~~~~~~~ Subtract multiple columns. .. figure:: images/sub.png :class: with-shadow :scale: 100 *Example for Subtraction of two columns* Multiplication ~~~~~~~~~~~~~~~~~~~ Multiply multiple columns. .. figure:: images/multiplication.png :class: with-shadow :scale: 100 *Example for Multiplication of two columns* Division ~~~~~~~~~~~~~ Divide multiple columns. Its a reverse function from multiplication. .. figure:: images/division.png :class: with-shadow :scale: 100 *Example for Division* .. note:: The functions (D, E, F) are designed to operate sequentially from left to right. Remainder ~~~~~~~~~~~~~~ The remainder is the integer left over after division, indicating the part that remains unallocated. .. figure:: images/Remainder.png :class: with-shadow :scale: 100 *Example for Remainder* Data transformation ~~~~~~~~~~~~~~~~~~~~~~~~ This function includes several mathematical operations - *Box cox*, *Yeo Johnson*, *Cubic root*, *Exponential*, *Log*, *Normalization*, *Power2*, *Power3*, *Reciprocal*, *Square root* and *Standardization*. After selecting your chosen operation a new column is created at the bottom. Box Cox ^^^^^^^ The Box-Cox transformation is a statistical method used to stabilize variance and make data more normally distributed (symmetric bell-shaped distribution where the mean, median, and mode are all equal) This dataset is skewed to the right, indicating that most of the house prices are lower. To apply the Box-Cox transformation mathematically, Choose a range of values for the λ (lambda) parameter to test the transformation. Let's say we try λ values ranging from -2 to 2. For each λ value, calculate the transformed values using the Box-Cox transformation formula: .. math:: y_i = \begin{cases} \frac{x_i^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \ln(x_i), & \text{if } \lambda = 0 \end{cases} :math:`y_i` represents the transformed value and :math:`x_i` represents the original value. Example: Consider the dataset: X={1200,2000,4500,800,3500,3000,1500,2500,4000,1000} For λ ≠ 0 λ=0.2784475681948208: .. math:: y_1 &= \frac{1200^{0.2784475681948208} - 1}{0.2784475681948208} \approx 22.26977 \\ y_2 &= \frac{2000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 26.22267 \\ y_3 &= \frac{4500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 33.77536 \\ y_4 &= \frac{800^{0.2784475681948208} - 1}{0.2784475681948208} \approx 19.50882 \\ y_5 &= \frac{3500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 31.24991 \\ y_6 &= \frac{3000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 29.78606 \\ y_7 &= \frac{1500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 23.92758 \\ y_8 &= \frac{2500^{0.2784475681948208} - 1}{0.2784475681948208} \approx 28.13388 \\ y_9 &= \frac{4000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 32.56974 \\ y_10 &= \frac{1000^{0.2784475681948208} - 1}{0.2784475681948208} \approx 20.98964 \\ .. note:: The lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve. .. figure:: images/boxcox.png :class: with-shadow :scale: 100 *Example for Box Cox Transformation* Yeo Johnson ^^^^^^^^^^^^ The Yeo-Johnson transformation extends the Box-Cox transformation to handle both positive and negative values, as well as zero values. It is useful for stabilizing variance and making data more normally distributed. .. math:: y_i = \begin{cases} \frac{(x_i + 1)^{\lambda_i} - 1}{\lambda_i}, & \text{if } \lambda_i \neq 0, x_i \geq 0 \\ \ln(x_i + 1), & \text{if } \lambda_i = 0, x_i \geq 0 \\ -\frac{(|x_i| + 1)^{2 - \lambda_i} - 1}{2 - \lambda_i}, & \text{if } \lambda_i \neq 2, x_i < 0 \\ -\ln(|x_i| + 1), & \text{if } \lambda_i = 2, x_i < 0 \end{cases} where: - :math:`x_i` is the original value, - :math:`y_i` is the transformed value, - :math:`λ_i` is a parameter that varies for each feature. Example: Let's consider x=3 and lambda (λ) = 0.7995651776710038, .. math:: \begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 2.389922907892544 \\ x=6 and lambda (λ) = 0.7995651776710038, .. math:: \begin{align*} x'_1 &= ((3 + 1)^{0.7995651776710038} - 1) / 0.7995651776710038 \\ &= 4.191151663607149 \\ .. note:: Factors influencing the lambda value include the data's skewness, distribution, variance, sensitivity to outliers and optimization method. The lambda value is empirically determined to maximize the normality and symmetry of the transformed data. | .. figure:: images/yeojohnson.png :class: with-shadow :scale: 100 *Example for Yeo Johnson Transformation* Cubic Root ^^^^^^^^^^^ The value when multiplied by itself twice, yields the original number is referred to as the cubic root. For instance, the cubic root of 8 is 2. .. math:: y_i = \sqrt[3]{x_i} where: - :math:`x_i` is the original value, - :math:`y_i` is the transformed value. Example: .. math:: \sqrt[3]{27} = 27^{1/3} = 3.0 .. math:: \sqrt[3]{64} = 64^{1/3} = 4.0 | .. figure:: images/cubicroot.png :class: with-shadow :scale: 100 *Example for Cubic Root* Exponential ^^^^^^^^^^^^ Exponential refers to a mathematical operation or function where a constant (the base) is raised to the power of an exponent, resulting in a rapidly increasing or decreasing function. It is commonly represented as :math:`a^x`, where a is the base and x is the exponent. An example of an exponential function is :math:`f(x) = 2^x`. Let's evaluate this function for some values of :math:`x`: - When :math:`x = 0`, :math:`f(0) = 2^0 = 1`. - When :math:`x = 1`, :math:`f(1) = 2^1 = 2`. .. figure:: images/exponential.png :class: with-shadow :scale: 100 *Example for Exponential Transformation* Log ^^^ The log transformation is a mathematical operation that calculates the logarithm of a number. It is commonly used to transform skewed data to make it more symmetrically distributed. The transformation formula is as follows: .. math:: y_i = \log(x_i) where: - :math:`x_i` is the original value, - :math:`y_i` is the transformed value. Example: .. math:: \log_{10}(100) = \frac{\ln(100)}{\ln(10)} We can substitute :math:`\ln(100)` and :math:`\ln(10)` with their respective values: .. math:: \ln(100) = \ln(e^4) = 4\ln(e) \ln(10) \approx 2.30258 Finally: .. math:: \log_{10}(100) \approx \frac{4\ln(e)}{2.30258} Since :math:`\ln(e) = 1`, we have: .. math:: \log_{10}(100) \approx \frac{4}{2.30258} \approx 1.732 Therefore, :math:`\log_{10}(100)` is approximately 1.732. .. math:: \log_2(8) = \frac{\ln(8)}{\ln(2)} We can substitute :math:`\ln(8)` and :math:`\ln(2)` with their respective values: .. math:: \ln(8) = \ln(e^2.07944) \approx 2.07944 \ln(2) \approx 0.69315 Finally: .. math:: \log_2(8) \approx \frac{2.07944}{0.69315} This simplifies to: .. math:: \log_2(8) \approx 3 Therefore, :math:`\log_2(8)` is equal to 3. .. figure:: images/log.png :class: with-shadow :scale: 100 *Example for Log Transformation* Normalization ^^^^^^^^^^^^^^ Normalization is a technique used to scale numeric features to a standard range, typically between 0 and 1 or between -1 and 1. It is used statistical analyses to ensure that all features have the same scale and to prevent features with larger values from dominating those with smaller values. For Example, Consider the following dataset, :math:`{Data} = [2, 5, 8, 10, 15]` To normalize these values using min-max scaling, we'll use the formula: .. math:: x_{\text{norm}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} Where: - :math:`x_min= 2` is the minimum value of the dataset, - :math:`x_max = 15` is the maximum value of the dataset. Now, we can calculate the normalized values using the formula: For x = 2: :math:`x_{\text{norm}} = \frac{2 - 2}{15 - 2} = 0` For x = 5: :math:`x_{\text{norm}} = \frac{5 - 2}{15 - 2} = \frac{3}{13} \approx 0.2308` For x = 8: :math:`x_{\text{norm}} = \frac{8 - 2}{15 - 2} = \frac{6}{13} \approx 0.4615` For x = 10: :math:`x_{\text{norm}} = \frac{10 - 2}{15 - 2} = \frac{8}{13} \approx 0.6154` For x = 15: :math:`x_{\text{norm}} = \frac{15 - 2}{15 - 2} = 1` So, after normalizing the data using min-max scaling, the normalized dataset would be: Normalized Data = [0, 0.2308, 0.4615, 0.6154, 1] These values are now scaled to the range between 0 and 1. .. figure:: images/normalization.png :class: with-shadow :scale: 100 *Example for Normalization* Power2 ^^^^^^^ The term "Power of 2" typically refers to the operation of exponentiation, where a number is raised to the second power. In mathematical notation, it is represented as :math:`"x^2"` where, x is the base number. This operation involves multiplying the base number by itself. For example: .. math:: 2^{2} = 4 In general, raising a number to the power of 2 is known as squaring that number. .. figure:: images/power2.png :class: with-shadow :scale: 100 *Example for Power 2* Power3 ^^^^^^^ The term "power of 3" also refers to exponentiation, where a number is raised to the third power. In mathematical notation, it is represented as :math:`"x^3"` where, x is the base number. This operation involves multiplying the base number by itself twice. For example: .. math:: 1200^{3} = 1,728,000,000 In general, raising a number to the power of 3 is known as cubing that number. .. figure:: images/power3.png :class: with-shadow :scale: 100 *Example for Power 3* Reciprocal ^^^^^^^^^^^ The reciprocal of a number is calculated by dividing 1 by that number. Mathematically, if we have a number represented as :math:`x`, the reciprocal of :math:`x` is represented as :math:`\frac{1}{x}`. For example: :math:`x = 5`, the reciprocal of 5 = 1/5. Similarly, :math:`x = 0.5`, the reciprocal of 0.5 is 1/0.5 = 2. The reciprocal of a number essentially represents how many times that number can fit into 1. For instance, the reciprocal of 5 is (1/5) means that 5 fits into 1 one-fifth of a time. .. figure:: images/reciprocal.png :class: with-shadow :scale: 100 *Example for Reciprocal* Square Root ^^^^^^^^^^^^ The square root of a number is a value that, when multiplied by itself, gives the original number. In mathematical notation, the square root of a number x is represented as √x. For example: :math:`\sqrt{25}` = 5. Since 5 * 5 = 25 The square root operation can also be denoted using exponentiation: x^(1/2). This means raising x to the power of one-half, which is equivalent to taking the square root of x. .. figure:: images/squareroot.png :class: with-shadow :scale: 100 *Example for Square Root* Standardization ^^^^^^^^^^^^^^^ In statistics, standardization is a process used to transform data so that it has a mean of zero and a standard deviation of one. This technique is applied to individual variables within a dataset, making them comparable and facilitating statistical analysis. The formula for standardizing a variable x is: .. math:: z = \frac{x - \mu}{\sigma} Where: - :math:`x` is the original value of the variable, - :math:`\mu` is the mean of the variable, - :math:`\sigma` is the standard deviation of the variable, and - :math:`z` is the standardized value. Standardization transforms the distribution of the data to have a mean of zero and a standard deviation of one. This process does not change the shape of the distribution but places it on a common scale, allowing for easier comparison between variables and datasets. Standardization is particularly useful in statistical modeling, machine learning, and data analysis, where it helps improve the interpretability and performance of models by ensuring that variables are on a consistent scale. Example: Consider the data x= [5, 20, 8, 3, 4, 6, 15, 12, 9, 10] Mean :math:`\bar{x}` = :math:`\frac{{5 + 20 + 8 + 3 + 4 + 6 + 15 + 12 + 9 + 10}}{{10}}` = 8.2 \ Standard Deviation :math:`\sigma` = :math:`\sqrt{\frac{{\sum_{i=1}^{10} (x_i - \bar{x})^2}}{{10}}}` = 5.03 \ Standardized Data :math:`z_i` = :math:`\frac{{x_i - \bar{x}}}{{\sigma}}` where :math:`x_i` is each data point in the given data. .. math:: z_1 &= \frac{5 - 8.2}{5.03} \approx -0.64 \\ z_2 &= \frac{20 - 8.2}{5.03} \approx 2.35 \\ z_3 &= \frac{8 - 8.2}{5.03} \approx -0.04 \\ z_4 &= \frac{3 - 8.2}{5.03} \approx -1.03 \\ z_5 &= \frac{4 - 8.2}{5.03} \approx -0.83 \\ z_6 &= \frac{6 - 8.2}{5.03} \approx -0.43 \\ z_7 &= \frac{15 - 8.2}{5.03} \approx 1.35 \\ z_8 &= \frac{12 - 8.2}{5.03} \approx 0.76 \\ z_9 &= \frac{9 - 8.2}{5.03} \approx 0.16 \\ z_{10} &= \frac{10 - 8.2}{5.03} \approx 0.36 .. note:: Mathematical operations can be performed only for the Numerical variables. Statistics - Text ------------------ Text Processing ~~~~~~~~~~~~~~~~~ Text processing is used to analyze most used words in a written text. Can also be used to identify SPAM emails, by searching for specific key words. By doing the text processing you will get the top 100 used words as new columns. Select the preferred N grams from 1 to 5. Select the Encode type such as : Boolean( 0 or 1), Count. Select Custom stop words if required. Click on apply button. .. note:: The output will appear at the bottom of the page. You can also preview the output on the preview page. | .. figure:: images/textprocessing.png :class: with-shadow :scale: 100 *Example for Text Processing* Text->Cat ~~~~~~~~~~~ Converts Text data type into Categorical data type. Text data type before conversion, .. figure:: images/beforeconversion.png :class: with-shadow :scale: 100 *Example for Text->Cat before conversion* After conversion, .. figure:: images/afterconversion.png :class: with-shadow :scale: 100 *Example for Text->Cat after conversion* .. note:: This operation will not generate a new column for the output; instead, it will only modify the data type within the same column, altering the output accordingly. .. figure:: images/output.png :class: with-shadow :scale: 100 .. note:: Text functions can be performed only for the Text variable. Statistics- Data ----------------- Label encoding ~~~~~~~~~~~~~~~ Label encoding is a technique used to convert categorical variables into numerical format. Consider a dataset with a categorical variable "Temperature" and its corresponding values are "Hot," "Cold," and "Warm." Using label encoding, numerical labels are assigned to each unique category. The labels are assigned based on alphabetical order. .. note:: This function applicable only for the Categorical column. | .. figure:: images/labelencoding.png :class: with-shadow :scale: 100 *Example for Label Encoding* One Hot Encoding ~~~~~~~~~~~~~~~~~~~ One hot encoding is another method used to represent categorical variables in a format suitable for machine learning algorithms. Instead of assigning numerical labels like label encoding, one hot encoding creates binary vectors for each category. .. note:: This function is applicable for both categorical and text columns. It does not require a user-defined output column name; instead, it will automatically generate a column name for the output. .. figure:: images/onehotencoding.png :class: with-shadow :scale: 100 *Example for One Hot Encoding* Renaming Factor Level ~~~~~~~~~~~~~~~~~~~~~~ Changing the labels or names of categories in a categorical variable to make them more descriptive or meaningful. .. note:: This Function applicable for all data types. .. figure:: images/renamingfactor.png :class: with-shadow :scale: 100 *Example for Renaming Factor* .. figure:: images/validation.png :class: with-shadow :scale: 100 Ignore ~~~~~~~~~ Refers to excluding Certain variables to filtering out irrelevant or noisy data. .. note:: Multiple columns can also be ignored. .. figure:: images/ignore.png :class: with-shadow :scale: 100 *Example for Ignore* Unignore ~~~~~~~~~ Reverse function of ignore (Ignored variables can be undo by using this function). Unignoring the ignored variables. .. figure:: images/unignore.png :class: with-shadow :scale: 100 *Example for Unignore* Impute Missing ~~~~~~~~~~~~~~~~~ Involves filling in missing data points in a dataset. Techniques available for Numerical Data Type include: - Mean (Often known as the average, represents the central tendency of a dataset. It's computed by adding up all values and dividing by the total count of values.) - Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there's an even number of values, it's the average of the two middle values.) - Mode (Value that appears most frequently in a dataset. It's the most common observation or value in the data.) - Default - Moving Average (Smooths out fluctuations in data by calculating the average of a subset of data points within a sliding window.) - Forward Filling (Copying the last observed value forward to fill in missing values.) - Backward Filling (Copying the next observed value backward to fill in missing values in a dataset.) - Custom Filling (Filling missing values using a user-defined method.) - Average of previous and next (Involves taking the average of the values immediately preceding and following the missing value to estimate its value.) - KNN (Missing values are replaced with the average or median value of the k nearest neighbors.) The available techniques for "Categorical" variable are, - Mode - Forward Filling - Backward Filling - Custom Filling - KNN .. figure:: images/categorical.png :class: with-shadow :scale: 100 *Techniques for categorical variable* The available technique for "Date" variable are, - Mode - Forward Filling - Backward Filling .. figure:: images/categorical.png :class: with-shadow :scale: 100 *Techniques for Numerical variable* The available technique for "Text" variable are, - Mode - Forward Filling - Backward Filling - Custom Filling - KNN | .. figure:: images/imputemissing.png :class: with-shadow :scale: 100 *Example for Impute Missing* Delete Missing ~~~~~~~~~~~~~~~ Deleting rows with missing values can be a preferred approach when those values are critical for analysis and cannot be easily imputed. This function removes all rows containing missing values across all columns in the dataset, ensuring that only complete observations are retained for analysis. .. figure:: images/deletemissing.png :class: with-shadow :scale: 100 *Example for Delete Missing* Copy ~~~~~ Copying a column or creating a duplicate allows for performing two different operations on the same column, which can be useful. This process simply involves creating a copy of the column, and then proceeding with the desired operations. The new column is created at the bottom. .. figure:: images/copy.png :class: with-shadow :scale: 100 *Example for Copy* Column Concat ~~~~~~~~~~~~~~ The function is used to concatenate two or more columns into one column. For instance, if there are columns for "First name" and "Surname," they can be merged into a single column. For example, "Name" and "Sex" columns are being combined. A delimiter is manually selected to clearly mark the transition between the values of the first and second columns. .. figure:: images/columnconcat.png :class: with-shadow :scale: 100 *Example for Column Concat* Cross Tab ~~~~~~~~~~ Cross tabulation is a method to organize and summarize data from two categorical variables into a table. It helps to see how these variables are related by counting the occurrences of each combination of categories. .. figure:: images/crosstab.png :class: with-shadow :scale: 100 *Example for Cross Tab* To transpose the output, click on the 'inverse' option .. figure:: images/crosstabinverse.png :class: with-shadow :scale: 100 *Example for Cross Tab Inverse* Filter ~~~~~~~~~~ Filter the data using multiple conditions. Available Filters for the Data types are, Filters for Numerical Data Type Condition ^^^^^^^^^^ Numerical data often involves conditions such as: **Equal to:** Filters data where the numerical value is exactly equal to a specified value. For example, filtering sales data for transactions where the amount equals $1000. **Not Equal to:** Filters data where the numerical value is not equal to a specified value. For instance, excluding transactions with a sales amount of $1000. **Is Greater Than:** Filters data where the numerical value is greater than a specified threshold. For example, filtering sales data for transactions with amounts greater than $1000. **Is Greater Than or Equal to:** Filters data where the numerical value is either greater than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts greater than or equal to $1000. **Is Less Than:** Filters data where the numerical value is less than a specified threshold. For example, filtering sales data for transactions with amounts less than $1000. **Is Less Than or Equal to:** Filters data where the numerical value is either less than or equal to a specified threshold. For instance, filtering sales data for transactions with amounts less than or equal to $1000. **Is Null:** Filters data where the numerical value is null or missing. **Is Not Null:** Filters data where the numerical value is not null or missing. **Between:** Filters data where the numerical value falls within a specified range. For example, filtering sales data for transactions with amounts between $500 and $1000. .. note:: This filter is applicable to Categorical and Text data types. | .. figure:: images/filter.png :class: with-shadow :scale: 100 *Example for Filter* For adding more rules and conditions, click on "Add a Group" and choose between an "and" or "or" option for each step. Once the "Apply" button is clicked, the entire dataset will be modified to include only those entries that meet the specified filter criteria. .. figure:: images/addrule.png :class: with-shadow :scale: 100 *Example for Add Rule and Add Group* Grouped Statistics ~~~~~~~~~~~~~~~~~~~ Grouped statistics organize data into groups and calculate statistical measures for each group, offering detailed insights into how variables behave within specific categories. The available aggregation are: - Mean (Often known as the average, represents the central tendency of a dataset. It's computed by adding up all values and dividing by the total count of values.) - Median (Middle value in a dataset when the values are arranged in ascending or descending order. If there's an even number of values, it's the average of the two middle values.) - Variance (Value that appears most frequently in a dataset. It's the most common observation or value in the data.) - Standard Deviation (How the data is spread around the mean) - Unique (Number of unique values in the dataset) - Count (Total number of elements present in a particular variable or column) - Sum (Total value obtained by adding together all the individual values in a specific variable) Example: To calculate average booking amount for each destination city, select INR_Amount with, .. figure:: images/groupstat.png :class: with-shadow :scale: 100 .. figure:: images/groupedstat.png :class: with-shadow :scale: 100 *Example for Grouped Statistics* Transpose ~~~~~~~~~~ Transpose refers to the operation of flipping or rotating a dataset, typically switching its rows and columns. This operation can be useful for various purposes, such as changing the orientation of data for easier analysis or visualization, or preparing data for specific computations or algorithms. When a dataset is transposed, the rows become columns and vice versa. Transpose Methods: - Split & Transpose - Combine & Transpose Split & Transpose ^^^^^^^^^^^^^^^^^^ Dividing a text string based on a separator and then rearranging the resulting parts into columns or rows. .. note:: This functionality will work only for Categorical variable(s). | .. figure:: images/seperator.png :class: with-shadow :scale: 100 .. figure:: images/split&transpose.png :class: with-shadow :scale: 100 *Example for Split & Transpose* Combine & Transpose ^^^^^^^^^^^^^^^^^^^^ Merging multiple columns into one column, and then rearranging the data from rows to columns or vice versa. .. figure:: images/transpose.png :class: with-shadow :scale: 100 .. figure:: images/combine&transpose.png :class: with-shadow :scale: 100 *Example for Combine & Transpose* Drop Duplicates(Row Wise) ~~~~~~~~~~~~~~~~~~~~~~~~~ Dropping duplicates refers to removing Duplicates in column or entire row duplicates. This operation ensures that each row in the dataset is unique, which can be important for various data analysis tasks In the selected Column ( from_city) contains 2033 missing records. .. figure:: images/duplicatecount.png :class: with-shadow :scale: 100 .. figure:: images/dropduplicates.png :class: with-shadow :scale: 100 After applying the function to the selected column, the validation results will be displayed. .. figure:: images/successvalidation.png :class: with-shadow :scale: 100 | .. figure:: images/result.png :class: with-shadow :scale: 100 *Result* Statistics- Advanced --------------------- Pivot analysis ~~~~~~~~~~~~~~~~ Analysis based on pivoting data - a Pivot Table is used to summarise, sort, reorganise, group, count, total or average data stored in a table. It allows us to transform columns into rows and rows into columns. We have two options in this section - either a “Pivot mode” or a regular chart mode. Regular chart mode ^^^^^^^^^^^^^^^^^^^ In the regular chart mode we can select all the rows that we are interested in, in our example we selected Age, Fare, Sex, Ticket purchase, Cabin Detail and Number of meals. We get a basic chart with the selected values. .. figure:: images/RegularModeSelect.png :class: with-shadow :scale: 100 *RegularModeSelect* In the next step, we can also divide our selected columns in groups, by dragging the preferred group type down to the “Row groups” section. We have selected “Sex” to differentiate the groups. We can see we have 577 items in group “Male” and 314 items in group “Female”. We can open each group by clicking on the group name. .. figure:: images/RegularModeRowGroups.png :class: with-shadow :scale: 100 *RegularModeRowGroups* In the next step, we can add some other numerical values that we are interested in. In our case, we dragged the “Fare” down bellow to the “Values” section and by clicking on the desired value in the bottom green box. We can also select in which form the value should be displayed - *average, count, first, last, max, min* or *sum.* In our case it was the “average fare” value. .. figure:: images/RegularModeValues.png :class: with-shadow :scale: 100 *RegularModeValues* Pivot mode ^^^^^^^^^^^ Firstly it is necessary to switch to the “Pivot mode” in the top right part of the screen .. figure:: images/PivotModeOn.png :class: with-shadow :scale: 100 *PivotModeOn* After that, select the fields you want to have in your Pivot table. For our demonstration we selected - Survived, Age, Sex, Ticket purchase and Meal Cost. “Ticket purchase” was dragged to the “Row Groups”, so it was automatically shifted to the left side as the first column. We also shifted “Sex” to the “Column Labels” section, so our chart was split into two sections - one section showing only the “Female” data and second showing only the “Male” data. Rest of our selected fields (Survived, Age and Meal cost) remain in the “Values” section, where we can choose again in which form each field should be displayed (*average, count, first, last, max, min* or *sum.* In our case it was the “average fare”). .. figure:: images/PivotMode.png :class: with-shadow :scale: 100 *PivotMode* More tools are hidden next to each value column as per the bellow image .. figure:: images/ToolsButton.png :class: with-shadow :scale: 100 *ToolsButton* .. figure:: images/MoreTools.png :class: with-shadow :scale: 100 You can also export or copy the chart with the right click on the table. .. figure:: images/RightClick.png :class: with-shadow :scale: 100 Jupyter Notebook ~~~~~~~~~~~~~~~~~ Jupyter_ is an open-source, interactive web tool known as a computational notebook, which researchers can use to combine software code, computational output, explanatory text and multimedia resources in a single document. User can create a customised program/code using Jupyter notebook and import it back to EDGE. .. _Jupyter: https://jupyter.org/ Regular Expression ~~~~~~~~~~~~~~~~~~ Regular Expressions (REs) offer a mechanism to select specific strings from a set of character strings. They provide a context-independent syntax capable of representing a wide variety of character sets and their orderings, with interpretations based on the current locale. In addition to using Search and Replace for finding and replacing simple text strings, you can perform more precise searches using Regular Expressions (RegEx). RegEx allows you to accurately match a particular pattern in your search. Select the Method such as : - Extract - Search - Replace .. figure:: images/extract.png :class: with-shadow :scale: 100 *Extract* Example: Lets try to extract '_', .. figure:: images/regextract.png :class: with-shadow :scale: 100 *Example for Extract* Example: Lets try to search 'Chennai', .. figure:: images/search.png :class: with-shadow :scale: 100 .. figure:: images/regsearch.png :class: with-shadow :scale: 100 *Example for Search* Example: Lets replace 'Bālāpur' with 'Balapur' .. figure:: images/regreplace.png :class: with-shadow :scale: 100 *Example for Replace* Delta ~~~~~~ Delta (Δ) is a generalized function that calculates the difference between each row value and the value from its corresponding subtrahend position in the given numerical column grouped & sorted through the groupBy & orderBy columns respectively. .. figure:: images/delta.png :class: with-shadow :scale: 100 .. figure:: images/egdelta.png :class: with-shadow :scale: 100 *Example for Delta* Statistics- Boolean ------------------- Refers to a data type that represents one of two states: true or false. Booleans are commonly used to filter, subset, or mask data based on certain conditions. Is Null ~~~~~~~~ Refers to missing or nonexistent values in a dataset. When checking for null values in a specific data field or observation, the result is typically a boolean (true or false) indicating whether the value is null (true) or not null (false). This process of checking for null values is essential for data cleaning and quality assessment before analysis. .. figure:: images/null.png :class: with-shadow :scale: 100 *Example for Is Null* Is Even ~~~~~~~~ Determining whether a given number is divisible by 2 without any remainder, indicating that it is an even number. It will return true if the number is even and false if it is not. .. figure:: images/even.png :class: with-shadow :scale: 100 *Example for Is Even* Is Odd ~~~~~~~~~~ Determining whether a given number is not divisible by 2 without any remainder, indicating that it is an odd number. It will return true if the number is odd and false if it is not. .. note:: This function only applicable to Numerical Variables. | .. figure:: images/odd.png :class: with-shadow :scale: 100 *Example for Is Odd* Is Number ~~~~~~~~~~ Checks whether a given value is a numerical type. It will return true if the value is a number and false if it is not. .. figure:: images/number.png :class: with-shadow :scale: 100 *Example for Is Number* Is Text ~~~~~~~~ Checks whether a given value is a string or textual data type. It will return true if the value is a number and false if it is not. .. figure:: images/egtext.png :class: with-shadow :scale: 100 *Example for Is Text* Negate Boolean ~~~~~~~~~~~~~~~ Means flipping its truth value. For example, if the original boolean value is true, its negation would be false, and vice versa. It's a way to change a statement from being true to false, or vice versa. .. note:: This function applicable only to the Boolean values. | When selecting non boolean data type variable it will show the validation called "Selected column has no boolean values. .. figure:: images/validation_negate.png :class: with-shadow :scale: 100 .. figure:: images/negate.png :class: with-shadow :scale: 100 *Example for Negate Boolean* Frequently Used ---------------- The functions commonly used for analysis are: - Label Encoding - One Hot Encoding - Ignore - unignore - Impute Missing - Extract Date Parts - Data Transform - Binning Graphstat ---------- This feature presents a bar chart for the selected column, displaying the value and its count. It supports numerical, categorical, and date data types for bar chart visualization. Numerical data types offer an option to adjust the number of Bins. Text data types are represented using a word cloud. Scatter plots can be enabled to visualize two numerical variables against each other, typically the selected variable and the target variable. Users can view charts in full-screen mode and utilize a print option. Chart download options include PNG, JPEG, PDF, and SVG formats. Data can be downloaded in CSV or XLS format. .. note:: Categorical and date data types display only bar charts, while numerical data types offer both scatter plots and bar charts. For categorical variables, the 'sort by' feature is available. .. figure:: images/sortby.png :class: with-shadow :scale: 100 .. figure:: images/cat.png :class: with-shadow :scale: 100 *Graphstat for Categorical Column* .. figure:: images/date.png :class: with-shadow :scale: 100 *Graphstat for Date Column* .. figure:: images/numeric.png :class: with-shadow :scale: 100 *Graphstat for Numerical Column* If Scatter Plot is required, enable the Scatter with Target variable toggle. .. figure:: images/scatteroption.png :class: with-shadow :scale: 100 After enabling the toggle, the Scatter will display .. figure:: images/scatterplot.png :class: with-shadow :scale: 100 *Graphstat for Numerical Column--Scatter Plot* The word cloud offers the N-grams feature, which ranges from 1 to 5. .. figure:: images/textoptions.png :class: with-shadow :scale: 100 .. figure:: images/textwc.png :class: with-shadow :scale: 100 *Graphstat for Text variable* Statistical Test ================= A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean. One Sample T Test - One Tailed --------------------------------- A one-sample t-test is used to determine if the mean of a single sample is significantly different from a known or hypothesized population mean. Steps to perform ~~~~~~~~~~~~~~~~~ - Select One Sample T Test - One tailed under test name. - Under Numerical column select Numerical variable. - Under Categorical column select categorical variable. Select if it is required. - Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm. - The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement. .. figure:: images/onetailed.png :class: with-shadow :scale: 100 **Sampling:** - Enable Sampling, if required. Choose the sampling method Random (each member of the population is selected purely by chance. This means that every individual or item has an equal chance to be included in the sample, and selection is not biased by any personal judgment or preference.)or Stratified (method of sampling in which the population is divided into subgroups, or strata, based on certain characteristics that are relevant to the research question. Then, samples are randomly selected from each stratum independently. This approach ensures that each subgroup is represented in the sample, and it allows for more precise estimates and comparisons within each stratum.) **Size:** .. figure:: images/size.png :class: with-shadow :scale: 100 - **% of a Variable:** Selecting a sample size that is a certain percentage of a particular variable. For instance, if the variable is the total population, and you choose 10%, your sample will include 10% of the total population. This approach is useful when you want the sample size to be proportional to the size of a certain variable. .. figure:: images/sizeinputs.png :class: with-shadow :scale: 100 - **Absolute Value:** Selecting a sample size by specifying an absolute number. For example, if you decide that your sample size should be 5 individuals, you simply select 5 individuals from the population. This method is straightforward and is often used when the required sample size is predetermined by external factors such as budget or time constraints. .. figure:: images/absinputs.png :class: with-shadow :scale: 100 - **By Calculator:** Using a calculator involves employing a statistical tool or software to determine the appropriate sample size based on certain parameters such as confidence level, margin of error, and population size. This method ensures that the sample size is statistically valid and sufficient to make reliable inferences about the population. .. figure:: images/bycalculatorinputs.png :class: with-shadow :scale: 100 - **Lazy Sampling:** A less rigorous approach where the sample is selected based on convenience or ease of access rather than using a strict randomization process. This method is quick and cost-effective but may introduce bias and is not always representative of the population. It’s often used in exploratory research or when resources are limited. .. figure:: images/lazysamplinginputs.png :class: with-shadow :scale: 100 - **Systematic Sampling:** Selecting every nth item from a list or population. For example, if you have a list of 1,000 individuals and you want a sample of 100, you would select every 10th individual. This method is easy to implement and ensures a spread across the population, but it assumes that the list is randomly ordered to avoid periodicity bias. .. figure:: images/ssinputs.png :class: with-shadow :scale: 100 - **Filtering:** Enable Filtering and filter the data, if required. Once the steps are done, Click on 'Test' and verify the result. .. figure:: images/resultos.png :class: with-shadow :scale: 100 *Output* Histogram-Density ~~~~~~~~~~~~~~~~~~ Provides a normalized representation of the distribution of data, making it easier to compare distributions and understand the relative proportions of data within different intervals. .. figure:: images/histogram.png :class: with-shadow :scale: 100 **Histogram:** - The histogram displays the frequency distribution of age data within specified intervals (bins). - The blue bars represent the number of data points (frequency) that fall within each age interval. - The height of each bar corresponds to the count of data points in that interval. **Density:** - The black curve represents the density estimate, showing the distribution of the data in a continuous manner. - The area under the density curve represents the entire data set and sums to 1. The histogram gives a direct count of data points within intervals, while the density plot provides a smoothed view, highlighting the shape of the distribution. Together, they offer a comprehensive view of the data, with the histogram showing discrete counts and the density plot illustrating the overall distribution pattern. T distribution ~~~~~~~~~~~~~~~~ Provides a visual representation of how the shape of the distribution changes with different sample sizes and illustrates the properties of the t-distribution relevant for statistical inference. .. figure:: images/t-distribution.png :class: with-shadow :scale: 100 **Central Tendency of Age Data:** - The peak of the density curve (red line) suggests that the most common or average age in the dataset is around 0 on a standardized scale. This means that the data is centered around this value. **Spread and Distribution of Age Data:** - The shape of the density curve (red line) and the t-distribution curve (blue line) indicates how age values are distributed in the dataset. - The t-distribution curve (blue line) closely follows the density curve, but it's slightly smoother and broader. This suggests that the data may have been fitted to a theoretical distribution, such as the t-distribution, often used in statistics for small sample sizes. **Significance of Tail Area:** - The shaded area under the t-distribution curve (pink) represents the tail of the distribution. - This area is significant in statistical terms, especially for hypothesis testing, where it helps determine the probability of observing values beyond a certain threshold (e.g., ages above a certain point). Data Observation ~~~~~~~~~~~~~~~~~ The observation of the data is also displayed. .. figure:: images/obs.png :class: with-shadow :scale: 100 Hypothesis ~~~~~~~~~~ Here the detailed hypothesis displayed. .. figure:: images/hypothesis.png :class: with-shadow :scale: 100 Summary ~~~~~~~ Here the Summary Stats are displayed. .. figure:: images/summarystats.png :class: with-shadow :scale: 100 One Sample T Test - Two Tailed ----------------------------------- In a one-sample t-test with a two-tailed hypothesis, one examines whether the sample mean significantly differs from a hypothesized population mean. This involves calculating the t-statistic based on the sample data and comparing it to critical values from a t-distribution. The aim is to determine if the mean significantly deviates from the hypothesized value in either direction, hence the term "two-tailed". Steps to perform ~~~~~~~~~~~~~~~~~~ - Select One Sample T Test - One tailed under test name. - Under Numerical column select Numerical variable. - Under Categorical column select categorical variable. Select if it is required. - Choose a hypothesis mean value based on existing theories, previous research findings, or practical considerations. For example, if previous studies suggest that the average height of adults is 170 cm, you might set your hypothesis mean value to 170 cm. - The default Alpha (Level of significance) value is 0.05. Change the value based on your requirement. - To learn more: 'Sampling' - To learn more: 'Filtering' Histogram-Density ~~~~~~~~~~~~~~~~~~ To learn more: 'Histogram-Density' T Distribution ~~~~~~~~~~~~~~~ To learn more: 'T distribution' To learn more: 'Data Observation' Shapiro-Wilk --------------- The Shapiro-Wilk test is a useful tool for determining the normality of a data set, which is a crucial assumption in many statistical analyses. It provides a clear decision-making process based on the p-value derived from the T-statistic. Steps to perform ~~~~~~~~~~~~~~~~~ - Select Shapiro-Wilk test under test name. - Select Numerical variable under Numerical clumn. To learn more: 'Sampling' To learn more: 'Filtering' Histogram-Density ~~~~~~~~~~~~~~~~~ To learn more: 'Histogram-Density' .. figure:: images/shapirooutput.png :class: with-shadow :scale: 100 *Output* Data Observation ~~~~~~~~~~~~~~~~~ The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won't display Standard error. .. figure:: images/shapiroobs.png :class: with-shadow :scale: 100 To learn more: 'T distribution' Hypothesis ~~~~~~~~~~ Under Hypothesis, this will give whether the sample looks normally distributed or not. .. figure:: images/hypo.png :class: with-shadow :scale: 100 Summary ~~~~~~~~ Under Summary, only the P Value and T-Statistics are displayed. .. figure:: images/sum.png :class: with-shadow :scale: 100 Paired Student T-Test ---------------------- The paired Student's t-test compares the means of two related groups to determine if there is a statistically significant difference. It is used when the same subjects are measured before and after a treatment or in matched pairs. The test calculates the difference between paired observations, then computes the mean and standard deviation of these differences to determine the t-statistic. If the p-value is less than a chosen significance level (e.g., 0.05), the null hypothesis (no difference) is rejected, indicating a significant difference between the means. This won't display the Histogram-Density Plot and T-distribution. Steps to perform ~~~~~~~~~~~~~~~~~~ - Select Paired Student T test under test name. - Select two Numerical variables under Numerical columns. - To learn more: 'Sampling' - To learn more: 'Filtering' Data Observation ~~~~~~~~~~~~~~~~~ The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won't display Standard error. .. figure:: images/pairedobs.png :class: with-shadow :scale: 100 Hypothesis ~~~~~~~~~~~ Under Hypothesis, will display only whether the two pairs are significant or not. .. figure:: images/pairedhypo.png :class: with-shadow :scale: 100 Summary Stats ~~~~~~~~~~~~~~ Under Summary, the summary statistics are displayed. .. figure:: images/pairedsum.png :class: with-shadow :scale: 100 Chi Square Test --------------- The Chi Square test is a statistical method used to determine whether there is a significant association between categorical variables. It compares the observed frequencies of categories with the expected frequencies if there were no association. This won't display the Histogram-Density Plot and T-distribution Steps to perform ~~~~~~~~~~~~~~~~ - Select Chi Square test under test name. - Select two Categorical variables under Categorical columns. - To learn more: 'Sampling' - To learn more: 'Filtering' Data Observation ~~~~~~~~~~~~~~~~~ The observation of the selected variable is displayed. Unlike One Sample T test - One Tailed & Two Tailed, this won't display Standard error. .. figure:: images/chiobs.png :class: with-shadow :scale: 100 Hypothesis ~~~~~~~~~~~~ Under Hypothesis, will display only whether the two pairs are significant or not. .. figure:: images/chihypo.png :class: with-shadow :scale: 100 Summary Stats ~~~~~~~~~~~~~~ Under Summary, the summary statistics are displayed. .. figure:: images/chisum.png :class: with-shadow :scale: 100 .. note:: In order to view the T distribution, click on 'T Distribution' next to 'Histogram-Density'. In order to view the Summary Stats click on the 'Summary' next to 'Hypothesis' in the 'Hypothesis & Summary' Panel. Transformations ======================= Absolute --------------------- **For video tutorial see:** |location_link1|. .. |location_link1| raw:: html SEDGE YouTube channel - A functions abs( ): Absolute value of a variable. The magnitude of a real number without regard to its sign. Numbers with negative values will be converted to positive ones. Syntax: .. code-block:: python abs(#column1#) AS <> New column name: Newabscolumn .. note:: The absolute value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored. Example: .. code-block:: python abs(#temperature#) Where (#temperature#) include following values -12, 6, 0.5, -6, ,25 After applying abs(#temperature#) the values will change to 12, 6, 0.5, 6, ,25 The column may have temperatures in -ve value and this will convert all negative temperatures to positive | ============================== Accrint ------------------------------- accrint(): The ACCRINT function returns the accrued interest for a security that pays interest periodically. The arguments can be explained as: * Issue - The issue date of the security * First_interest - The first date interest will be paid * Settlement - The settlement date of the security * Rate - The security's annual coupon rate. * Par - The security's par value. * Frequency - The number of interest payments per year. Choose from the following: * 1 = annual payments * 2 = semi -annual payments * 4 = quarterly payments * 12 = monthly payments Syntax: .. code-block:: python accrint(#Issue_Date#, #First_Interest_Date#, #Settlement_Date#, #Annual_Interest_Rate#, #Par#, #Frequency#) as <> New column name: Newcol_accrint .. note:: Issue_date, First_Interest_Date and Settlement_Date must be in date datatype. Example: accrint(#issue_Date#, #First_Interest#, #Settlement_Data#, #open_acc#, #Par#, 2) as <> Above example will return the expected result. | ============================== Arc cosine - Acos ------------------------------- acos( ): Arc cosine function result is the same as the inverse of cosine of a variable, the function returns angles in radians Syntax: .. code-block:: python acos(#column1#) AS <> New column name: Newacoscolumn .. note:: The acos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored Example: .. code-block:: python acos(#NumericalColumn#) AS <> Where (#numericalColumn#) value is between -1 and 1 After applying acos, we will get an angle expressed in radians | ============================== Add months ------------------------------- Add months( ): The function returns a date with given number of months added (date + integer months) Syntax: .. code-block:: python add_months(#DateColumn#, number_of_months) as <> New column name: NewAddColumn .. note:: The Add months function can only be applied to Date datatype columns Example: .. code-block:: python add_months(#DateOfHire#, 12) as <> Assuming new employees have a 12 month trial period, after adding 12 months to the Date of hire column we will get the date when the trial period terminates | ============================== .. _append-trans: Append ------------------------------- append( ): Appending a variable with a pattern to create a new variable. Syntax: .. code-block:: python append(#column1#, 'Pattern to be appended') AS <> New column name: NewColumn .. note:: Both Categorical and Numerical value and the new field which is created is a Categorical type. Example: .. code-block:: python append(#temperature#, 'Deg C') AS <> Assuming that the temperature column is a numeric value and appending the string 'Deg C' will append the text to the numerical value. Example '13.5' is the numerical value and the text 'Deg C' will be appended to read '13.5 Deg C', and the new column AppCol1 which is formed will be string. | ============================== Ascii ------------------------------- ascii( ): Function returns numerical value of the first character of a string Syntax: .. code-block:: python ascii(#Column#) as <> New column name: NewAsciiColumn .. note:: Works for categorical columns Example: .. code-block:: python ascii(#CategoricalColumn#) AS <> We have a categorical column "Geography" and we want to transform the variables into numerical Ascii representation. We have variable "France", "Spain" and "Germany" and after the transformation we get results of "70", "83" and "71" as per the ASCII characters table. | ============================== Arcsin ------------------------------- asin( ): The arcsin function is the inverse of the sine function. It returns the angle whose sine is a given number. Use arcsin when you know the sine of an angle and want to know the actual angle Syntax: .. code-block:: python asin(#Column#) as <> New column name: NewAsinColumn .. note:: The asin function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored Example: .. code-block:: python asin(#SineColumn#) as <> Assuming we have a column with sine value (0 to 1) we can determine the Asin value. In case our Sine value is 0,85 we will get an Asin value of 1.0159852938148251. | ============================== Atan - Arc tangent ------------------------------- atan( ): The Atan function returns the arc tangent in radians (which is the inverse tangent), of its argument. The arctangent means the angle whose tangent is the argument. Syntax: .. code-block:: python atan(#Column#) as <> New column name: NewAtanColumn .. note:: The atan function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored Example: .. code-block:: python atan(#NumericColumn#) as <> Assuming we have a column with numerical value and a row has value of 37, the Atan function will return the arc tangent value of 1.5437758 - angle in radians. | ============================== Atan2 ------------------------------- atan2( ): The atan2 function calculates one unique arc tangent value from two variables x and y and returns angle in radians Syntax: .. code-block:: python atan2(#Column_Y#, #Column_X#) as <> New column name: NewAtan2Column .. note:: The atan2 function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored Example: .. code-block:: python atan2(#NumericColumnX#, #NumericColumnY#) as <> Assuming we have 2 columns with numerical value and a row has value of x=44 and y=10 the Atan2 function will return the value of 1.3473197256542635 | ============================== Base64 ------------------------------- base64( ): The base64 function is most commonly used to encode binary data and transform the argument from binary format to a base 64 string Syntax: .. code-block:: python base64(#Column#) as <> New column name: NewBaseColumn Example: .. code-block:: python base(#CategoricalColumn#) as <> Assuming we have a categorical column with locations and one variable is France, the Base64 function will transform the value to RnJhbmNI as a base 64 string | ============================== Bigint ------------------------------- Bigint( ): The bigint function returns a 64-bit integer representation of a number or character string in the form of an integer constant Syntax: .. code-block:: python bigint(#Column#) as <> New column name: NewBigintColumn | ============================== Bin ------------------------------- Bin( ): The bin function takes in integer x and returns the binary representation of x in a string format Syntax: .. code-block:: python bin(#Column#) as <> New column name: NewBinColumn Example: .. code-block:: python bin(#NumericColumn#) as <> Assuming we have a numerical column with a value of 15, the Bin function will return its binary value of 1111 | ============================== Bit-length ------------------------------- Bit-length( ): The function returns the length of the given string in bits Syntax: .. code-block:: python bit_length(#Column#) as <> New column name: NewBitLColumn .. note:: The function can only be applied to Categorical datatype columns Example: .. code-block:: python bit_length(#Geography#) as <> Assuming we have a column with France as a variable, the function will return a value of 48 as a length of the string in bits | ============================== Bround ------------------------------- Bround( ): Returns expression rounded to n decimal places using HALF_EVEN rounding mode Syntax: .. code-block:: python bround(#Column#, n) as <> New column name: NewBroundColumn Example: .. code-block:: python bround(#NumericColumn#, 3) as <> Assuming we have a numerical column with a value of 51.93487, the Bround function will return its rounded value of 51.935 | ============================== Char ------------------------------- Char( ): The char function returns the character based on the ASCII code Syntax: .. code-block:: python char(#Column#) as <> New column name: NewCharColumn Example: .. code-block:: python char(#NumericColumn#) as <> Assuming we have a numerical column with a value of ASCII = 77, the Char function will return the character M | ============================== Char_length ------------------------------- Char_length( ): Gives the length in characters of the input string Syntax: .. code-block:: python char_length(#Column#) as <> New column name: NewCharLColumn .. note:: The function can only be applied to Categorical datatype columns Example: .. code-block:: python char_length(#CategoricalColumn#) as <> Assuming we have a categorical column "Geography" then the variable "Spain" will return 5 as a length of characters, "Germany" will return 7 etc. | ============================== Coalesce ------------------------------- Coalesce( ): The function returns the first non-null expression in the list. If all expressions evaluate to null, then the Coalesce function returns null Syntax: .. code-block:: python coalesce(#Column1#, #Column2#, ..., #ColumnN#) as <> New column name: NewCoalColumn .. note:: The function can be applied to Categorical and Numerical datatype columns Example: .. code-block:: python coalesce(#FinishDate#, #StartDate#) as <> Assuming we have a column "StartDate" and "FinishDate". Some of the variables in "FinishDate" are not filled in as a task was not completed yet. In this case the transformation will be searching for "FinishDate" firstly (written on first place in the syntax), if the value is null it will return the "StartDate" value. | ============================== Concat ------------------------------- Concat( ): The function is used to concatenate two strings to make a single string, for example concatenate two or more columns in one Syntax: .. code-block:: python concat(#Column1#, #Column2#, ..., #ColumnN#) as <> New column name: NewConcColumn Example: .. code-block:: python concat(#Sex#, #Age#) as <> Assuming we have a column "Sex" and "Age" and we want both of these variables in one single column. After applying the concat function, we will get results like "Male24", "Female43" etc. | ============================== Concat_ws ------------------------------- Concat_ws( ): The function joins the input strings into a single string. It separates those concatenated strings with the separator specified in the first argument Syntax: .. code-block:: python concat_ws('separator', #Column1#, #Column2#, ..., #ColumnN#) as <> New column name: NewConcWSColumn .. note:: The CONCAT_WS() requires at least two input strings Example: .. code-block:: python concat_ws('/', #Sex#, #Age#) as <> Assuming we have a column "Sex" and "Age" and we want both of these variables in one single column separated by "/". After applying the concat function, we will get results like "Male/24", "Female/43" etc. | ============================== Conv ------------------------------- Conv( ): The Conv function converts a number from one numeric base system to another, and returns the result as a string value Syntax: .. code-block:: python conv(#Column#, from_base, to_base) as <> New column name: NewConvColumn Example: .. code-block:: python conv(#Column1#, 11, 3) Convert a column from numeric base system 11 to numeric base system 3 .. note:: This function returns NULL if any of the parameters are NULL | ============================== Cos ------------------------------- Cos( ): The Cos function returns the cosine of a number Syntax: .. code-block:: python cos(#Column#) as <> New column name: NewCosColumn .. note:: The Cos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored Example: .. code-block:: python cos(#NumericalColumn#) as <> Assuming we have a numerical column with angles and we want to convert it into cosine value we use this function. A value of 2 will be converted into -0,416147. | ============================== Cosh ------------------------------- Cosh( ): The Cosh function returns the hyperbolic cosine of a number Syntax: .. code-block:: python cosh(#Column#) as <> New column name: NewCoshColumn .. note:: The Cosh function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored Example: .. code-block:: python cosh(#NumericalColumn#) as <> Assuming we have a numerical column with angles and we want to convert it into hyperbolic cosine value we use this function. A value of 3 will be converted into 10,067. | ============================== Cot ------------------------------- Cot( ): The Cot function returns the cotangent of a number Syntax: .. code-block:: python cot(#Column#) as <> New column name: NewCotColumn .. note:: The Cos function can only be applied to Numerical datatype columns The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored Example: .. code-block:: python cot(#NumericalColumn#) as <> Assuming we have a numerical column with angles and we want to convert it into cotangent value we use this function. A value of 6 will be converted into -3.436353. | ============================== Crc32 ------------------------------- Crc32( ): The Crc32 function function calculates a 32-bit CRC (cyclic redundancy checksum) for a string. This function can be used to validate data integrity Syntax: .. code-block:: python crc32(#Column#) as <> New column name: NewCrcColumn .. note:: The Crc32 function can only be applied to Categorical datatype columns Example: .. code-block:: python crc32(#CategoricalColumn#) as <> Assuming we have a categorical column and we want to convert it into 32-bit CRC value we use this function. A string word "France" will be converted into 1493835906. | ============================== Cube Root ------------------------------- **For video tutorial see:** |location_link2|. .. |location_link2| raw:: html SEDGE YouTube channel - C functions cbrt() : Finds the cube root of the value Syntax: .. code-block:: python cbrt(#MPG#) as <> New column name: Newcolcbrt .. note:: * A The cbrt value can only be applied to Numerical datatype columns. * B The new column name should not contain any special characters such as !@#$%^&*()-+= space * C If there are any missing rows in the numerical column, then the row will be ignored. Example: .. code-block:: python cbrt(#NewBornWeight#) AS <> Where (#NewWeight#) include following values 729, 27, 5832 After applying cbrt(#NewWeight#) the values will change to 9, 3, 18 | ============================== Ceiling ------------------------------- ceil() : Ceiling value of a number returns the greatest integer, greater than or equal to the number. The ceil of number 2.5 will generate the value 3. The ceil of the value -2.5 will generate the value -2. Syntax: .. code-block:: python ceil(#column1#) AS <> New column name: Newceilcolumn .. note:: The ceil value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored. Example: .. code-block:: python ceil(#NewBornWeight#) AS <> Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, , 2.7 After applying ceil(#NewBornWeight #) the values will change to 3, 4, 3, , 3 | ============================== .. _copy-trans: Copy ------------------------------- Copy( ): Copying a variable and naming it as a new variable Syntax: .. code-block:: python copy(#column1#) AS <> New column name: NewCopiedColumn .. note:: Both Categorical and Numerical value and the new field which is created are of the same datatype as the copied column. Example: .. code-block:: python copy(#Age#) as <> Column 'Age' is copied as new field 'new_age' | ============================== Current_date ------------------------------- Current_date( ): The CURRENT_DATE() function returns the current date at the time of query execution Syntax: .. code-block:: python current_date()as <> New column name: NewCDColumn .. note:: The date is returned as "YYYY-MM-DD" (string) or as YYYYMMDD (numeric) Example: .. code-block:: python current_date()as <> A new column with today's date is created - could be used to calculate a time difference between any other date column and current date. | ============================== Datediff ------------------------------- datediff( ): Returns the number of days between two date columns Syntax: .. code-block:: python datediff(#column1#, #column2#) as <> New column name: Newcol_diff .. note:: * The datediff function is only applied for date_columns * #column1#, #column2# must be a valid date_column Example: .. code-block:: python datediff(#DOB#, #Date of Hire#) as <> Above example returns the number of days between two date columns | ============================== Datediff_hrs ------------------------------- datediff_hrs( ): Returns the number of hours between two date columns Syntax: .. code-block:: python datediff_hrs(#column1#, #column2#) as <> New column name: Newcol_diffhrs .. note:: * The datediff_hrs function is only applied for date_columns * #column1#, #column2# must be a valid date_columns Example: .. code-block:: python datediff_hrs(#DOB#, #Date of Hire#) as <> Above example returns the number of hours between two date columns | ============================== Datediff_mins ------------------------------- datediff_mins( ): Returns the number of minutes between two date columns Syntax: .. code-block:: python datediff_mins(#column1#, #column2#) as <> New column name: Newcol_diffmins .. note:: * The datediff_mins function is only applied for date_columns * #column1#, #column2# must be a valid date_columns Example: .. code-block:: python datediff_mins(#DOB#, #Date of Hire#) as <> Above example returns the number of minutes between two date columns | ============================== Datediff_secs ------------------------------- datediff_secs( ): Returns the number of seconds between two date columns Syntax: .. code-block:: python datediff_secs(#column1#, #column2#) as <> New column name: Newcol_diffsecs .. note:: * The datediff_secs function is only applied for date_columns * #column1#, #column2# must be a valid date_columns Example: .. code-block:: python datediff_secs(#DOB#, #Date of Hire#) as <> Above example returns the number of seconds between two date columns | ============================== Date_add ------------------------------- date_add(): add days to the selected date Syntax: .. code-block:: python date_add(#Column#, Number_of_days_to_add) as <> New column name: Newcol_dateadd .. note:: * The date_add function is only applied for date_columns * • Number_of_days_to_add field should have digits Example: .. code-block:: python date_add(#DOB#,3) as <> Above example adds days to the selected date | ============================== Date_value ------------------------------- date_value( ): converts date to serial number Syntax: .. code-block:: python date_value(#column#, 'offset_date') as <> New column name: Newcol_value .. note:: * The date_value function is only applied for date_columns * Offset_date must be a valid date format(yyyy-MM-dd) Example: .. code-block:: python date_value(#DOB#, '1900-01-01') as <> Above example shows the conversion of date to serial number | ============================== Current_timestamp ------------------------------- Current_timestamp( ): The CURRENT_TIMESTAMP() function returns the current date and time Syntax: .. code-block:: python current_timestamp() as <> New column name: NewCTColumn .. note:: The date and time is returned as "YYYY-MM-DD HH-MM-SS" (string) or as YYYYMMDDHHMMSS.uuuuuu (numeric) Example: .. code-block:: python current_timestamp() as <> A new column with today's date and time is created - could be used to calculate a time difference between any other date column and current date. | ============================== Date_isin ------------------------------- Date_isin ( ): The function is used to check if a date column contains a value from selected date interval or not Syntax: .. code-block:: python date_isin(#DateColumn#, 'start_date', 'end_date') as <> New column name: NewDIIColumn .. note:: * The date_isin function is only applied for date_column * date1 and date2 refers the input_date that must have valid date format(yyyy-MM-dd) Example: .. code-block:: python date_isin(#DOB#, '2009-05-24', '2020-01-24') as <> Above example shows the given date column lies in between given dates (date1 and date2) | ============================== Date_sub ------------------------------- Date_sub ( ): The function returns the date after subtracting a selected number of days from the original date in the column Syntax: .. code-block:: python date_sub(#DateColumn#, number_of_days) as <> New column name: NewDateSubColumn Example: .. code-block:: python date_sub(#DateOfHire#, 7) as <> Above example shows the given date column with 7 days subtracted from each variable | ============================== Date_trunc ------------------------------- Date_trunc ( ): The function returns date with the time portion of the day truncated to the unit specified by the format model Format should be one of ["YEAR", "YYYY", "YY", "MON", "MONTH", "MM", "DAY", "DD", "HOUR", "MINUTE", "SECOND", "MILLISECOND", "MICROSCOND", "WEEK", "QUARTER"] Syntax: .. code-block:: python date_trunc('format', #DateColumn#) as <> New column name: NewTruncColumn | ============================== Day ------------------------------- Day ( ): The Day function returns the day of the month for a given date (a number from 1 to 31) Syntax: .. code-block:: python day(#DateColumn#) as <> New column name: NewDayColumn Example: .. code-block:: python day(#DateOfHire#) as <> Above example will extract each day from the respective column - for example '2020-09-27' will return '27' as a result | ============================== Dayofmonth ------------------------------- Dayofmonth ( ): The function returns the day of the month for a given date (a number from 1 to 31) Syntax: .. code-block:: python dayofmonth(#DateColumn#) as <> New column name: NewDOMColumn Example: .. code-block:: python dayofmonth(#DateOfHire#) as <> Above example will extract each day from the respective column - for example '2020-09-27' will return '27' as a result | ============================== Dayofweek ------------------------------- Dayofweek ( ): The function returns the weekday index for a given date (a number from 1 to 7) Syntax: .. code-block:: python dayofweek(#DateColumn#) as <> New column name: NewDOWColumn .. note:: 1=Sunday, 2=Monday, 3=Tuesday, 4=Wednesday, 5=Thursday, 6=Friday, 7=Saturday Input date column must be in format 'yyyy-MM-dd' Example: .. code-block:: python dayofweek(#DateOfHire#) as <> Above example will extract the numer of day in a week from the respective column - for example '2021-03-27' will return '7' for Saturday as a result | ============================== Dayofyear ------------------------------- Dayofyear ( ): The function returns the day of the year for a given date (a number from 1 to 366) Syntax: .. code-block:: python dayofyear(#DateColumn#) as <> New column name: NewDOYColumn .. note:: Input date column must be in format 'yyyy-MM-dd' Example: .. code-block:: python dayofyear(#DateOfHire#) as <> Above example will extract the numer of day in a year from the respective column - for example '2021-02-27' will return '58' as a result | ============================== Degrees ------------------------------- Degrees ( ): The function converts a value in radians to degrees Syntax: .. code-block:: python degrees(#Column#) as <> New column name: NewDegrColumn Example: .. code-block:: python degrees(#RadianColumn#) as <> Above example will convert a value in radians to degrees - for example a radian value 2 will give result of 114.592 in degrees. | ============================== Delete ------------------------------- **For video tutorial see:** |location_link3|. .. |location_link3| raw:: html SEDGE YouTube channel - D functions delete( ): Deleting string character from the 2nd to 5th position in the Column Syntax: .. code-block:: python delete(#column1#, 2, 5) AS <> New column name: NewColumn .. note:: Applies to Categorical data type and the new column which is created is also Categorical type. Example: .. code-block:: python delete(#column1#, 1, 6) AS <> The characters in column 1 in the position from 1st Character to the 6th Character is deleted. If the data in column 1 is “Hello World”, “Great Day”, after applying the formula, the new field will have the values “World” and “Day”. | ============================== Double ------------------------------- Double ( ): The function casts the value to the target data type Syntax: .. code-block:: python double(#Column#) as <> New column name: NewDoubColumn | ============================== Effect ------------------------------- effect( ): The EFFECT Function returns the effective annual interest rate, from the nominal annual interest rate (APR), and the number of compounding periods per year. The arguments can be explained as: * Nominal - Annual interest rate, also known as APR (Annual Percentage Rate) * Npery - Number of periods that interest is calculated Syntax: .. code-block:: python EFFECT(#nominal_rate#, #npery#) AS <> New column name: Newcol_effect Example: .. code-block:: python effect (#int_rate#, #open_acc#) as <> Above example will return the expected result. | ============================== Exponential ------------------------------- **For video tutorial see:** |location_link4|. .. |location_link4| raw:: html SEDGE YouTube channel - E functions exp( ): The exp() function allows users to calculate the exponential value with the base set to e. e is a mathematical constant, with a value approximately equal to 2.71828 and is also called Eulers constant. Syntax: .. code-block:: python exp(#column1#) AS <> New column name: NewColumn .. note:: * A - e is a Mathematical constant, with a value approximately equal to 2.71828 * B - Applies to Numerical data type and the new column which is created is also Numerical type. Example: .. code-block:: python exp(#column1#) AS <> Assuming the numerical value in the #column1# is 2 then exp(2) means 2.7182 to the power of 2 | ============================== Expm1 ------------------------------- Expm1 ( ): The expm1 function (e ^ x )- 1 returns e raised to the given power argument, minus 1.0, where e is the base of the natural logarithms (e is the Eulers number equal to 2.71828) Syntax: .. code-block:: python expm1(#Column#) as <> New column name: NewExpm1Column Example: .. code-block:: python expm1(#NumericalColumn#) AS <> Assuming the numerical value in the #column# is 3 then expm1 result will be 19.08554 | ============================== Factorial ------------------------------- **For video tutorial see:** |location_link5|. .. |location_link5| raw:: html SEDGE YouTube channel - F functions factorial( ): Factorial of an integer number. Syntax: .. code-block:: python factorial(#column1#) AS <> New column name: NewColumn .. note:: * A - Applies to Numerical data type and the new column which is created is also Numerical type. * B - If the value is non-integer, then the factorial will change the number to integer value and then apply factorial on integer. Example: .. code-block:: python factorial(#column1#) AS <> Assuming the numerical value in the #column1# is 5 then factorial(5) means 5 x 4 x 3 x 2 x 1 = 120 If the value in the column is non-integer then factorial(4.3) means 4 x 3 x 2 x 1 = 24 | ============================== .. _find-trans: Find ------------------------------- find( ): Finds the full character within the field, and returns true if it finds character, and false if not. Syntax: .. code-block:: mysql find(#marital_status#,'Married') AS <> The first word is the field name within #. Example #marital_status# The second is the text / word within double quotes that needs to be found. Example 'Married' New column name: NewColumn .. note:: * A Applies to Categorical data type and the new column which is created is Categorical type and the output is True or False. * B If the word is partly existing, it will not return True. * C The word being found is case sensitive. Example: .. code-block:: python find(#marital_status#,'Married') AS <> The word 'Married' if found in the field, it return True, else False. | ============================== Float ------------------------------- Float ( ): The Float function converts a specified value into a floating point number. Syntax: .. code-block:: python float(#Column#) as <> New column name: NewFloatColumn | ============================== Floor ------------------------------- floor( ) : Floor value of a number returns the largest integer value less than or equal to the specified number. The floor of number 2.5 will generate the value 2. The floor of the value -2.5 will generate the value -3. Syntax: .. code-block:: python floor(#column1#) AS <> New column name: Newfloorcolumn .. note:: The floor value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored. Example: .. code-block:: python floor(#NewBornWeight#) AS <> Where (#NewBornWeight #) include following values 2.2, 3.5, 2.8, ,2.7 After applying floor(#NewBornWeight #) the values will change to 3, 4, 3, , 3 | ============================== Format_number ------------------------------- Format_number ( ): The Format number function formats a number to a format like "#,###,###.##", rounded to a specified number of decimal places, then it returns the result as a string Syntax: .. code-block:: python format_number(#Column#, number of digits to be rounded to) as <> New column name: NewNumForColumn Example: .. code-block:: python format_number(#Distance#, 2) as <> Where distance is 153.4625 the result string will be 153.46. | ============================== From unixtime ------------------------------- From_unixtime ( ): This function converts the UNIX timestamp to default Datetime format (yyyy-MM-dd HH:mm:ss) Syntax: .. code-block:: python from_unixtime(#Column#) as <> New column name: NewFrUnColumn Example: .. code-block:: python from_unixtime(#DateColumn#) as <> Where UNIX time is 1255033470 the result converted time will be 2009-10-08 13:24:30 | ============================== From-utc-timestamp ------------------------------- From_utc_timestamp ( ): The function returns the current UTC date and time as a value in 'YYYY-MM-DD HH:MM:SS' or YYYYMMDDHHMMSS.uuuuuu format depending on the usage of the function i.e. in a string or numeric context. Syntax: .. code-block:: python from_utc_timestamp(#Datecolumn#, 'timezone') as <> **For the list of time zones click on the link below:** |location_link29|. .. |location_link29| raw:: html List of time zones New column name: NewTimeColumn Example: .. code-block:: python from_utc_timestamp(#Datecolumn#, 'Europe/Istanbul') as <> Where the selected date is 2016-08-31 and the time zone is Tokyo the result converted time will be 2016-08-31 03:00:00 | ============================== FV ------------------------------- fv( ): The FV function is a financial function that returns the future value of an investment. We can use the FV function to get the future value of an investment assuming periodic, constant payments with a constant interest rate. * Rate - Interest rate per compound period * N - The total number of payment periods * Pmt - The payment per period * Pv - The present value of the investment * Type - When payments are deposited #. end = End of each period #. begin = Beginning of each period Syntax: .. code-block:: python fv(#Annual_Interest_Rate#, #Number_of_Periods#, #Amount_of_Payments#, #Present_value#, 'end') as <> New column name: : Newcol_fv .. code-block:: python fv(#int_rate#, #open_acc#, #Par#, #pv#) as <> Above example will return the expected result. | ============================== Geo Dist Kms ------------------------------- Geo_dist_kms ( ): The function calculates the distance in kilometres between two geographical points Syntax: .. code-block:: python geo_dist_kms(#Lati_A#, #Lati_B#, #Longi_A#, #Longi_B#) as <> New column name: NewGDKColumn **For the example chart click on the link below:** |location_link7|. .. |location_link7| raw:: html Geo lat long chart Example: .. code-block:: python geo_dist_kms(#Lati_A#, #Lati_B#, #Longi_A#, #Longi_B#) as <> Where the selected Latitude A is 38.631913, Longitude A is -121.434879, Latitude B is 38.502519 and Longitude B is -121.420769 the calculated distance is 14.44016 Kms | ============================== Greatest ------------------------------- Greatest ( ): The function returns the greatest value from the list of arguments Syntax: .. code-block:: python greatest(#Column1#, #Column2#, ..., #ColumnN#) as <> New column name: NewGreatColumn Example: .. code-block:: python greatest(#Temperature1#, #Temperature2#, ..., #TemperatureN#) as <> Where the selected temperatures are 10, 4 and 35 degrees the function will select 35 value as the greatest one | ============================== Hexadecimal ------------------------------- **For video tutorial see:** |location_link8|. .. |location_link8| raw:: html SEDGE YouTube channel - H functions hex( ): Converts decimal numbers (base 10) to hexa decimal value (base 16). As a base-16 numeral system, it uses 16 symbols. These are the 10 decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) and the first six letters of the English alphabet (A, B, C, D, E, F). The letters are used in order to represent the values 10, 11, 12, 13, 14 and 15 each in one single symbol. Syntax: .. code-block:: python hex(#column1#) AS <> New column name: NewColumn .. note:: Applies to numerical data type and the new column which is created is Categorical type. Example: .. code-block:: python hex(#column1#) AS <> Converts the decimal numeral system to hexa decimal numeral system. .. figure:: images/hexadecimal.png :class: with-shadow :scale: 80 *image Hexa decimal explanation* | ============================== Holiday_count ------------------------------- holiday_count( ): Returns the number of holidays between two given date columns based on country Syntax: .. code-block:: python holiday_count(#column1#,#column2#, 'countrycode') as <> New column name: Newcol_holidaycount **For the list of country codes click on the link below:** |location_link9|. .. |location_link9| raw:: html List of country codes Example: .. code-block:: python holiday_count (#DOB#, #Date of Hire#, 'US) as <> Above example returns the number of holidays between two given date columns based on country | ============================== Hour ------------------------------- hour( ): The function returns the hour part for a given date (from 0 to 838) Syntax: .. code-block:: python hour(#Datecolumn#) as <> New column name: NewHourColumn Example: .. code-block:: python hour(#Datecolumn#) as <> Above example returns the hour number in a specific column a value of '2019-07-28 09:15:00' will return '9' as a result | ============================== Hypot ------------------------------- hypot( ): The function returns the square root of the sum of squares of its arguments - can be used as a calculation of hypotenuse of a triangle Syntax: .. code-block:: python hypot(#Column1#, #Column2#) as <> New column name: NewHypotColumn Example: .. code-block:: python hypot(#TriangleLegA#, #TriangleLegB#) as <> In case the length of one leg is 10 and second leg is 40 the hypotenuse result will be 41.23106. | ============================== If Else And ------------------------------- **For video tutorial see:** |location_link10|. .. |location_link10| raw:: html SEDGE YouTube channel - I functions If Else And: The if/else statement executes a code if a specified condition is true. If the condition is false, another code is executed. Syntax: .. code-block:: python if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <> New column name: NewAgeColumn2 .. note:: * A Applies to Categorical, Numerical, Text or Date data type and the new column which is created is either Categorical or Numerical type. * B The column name is case sensitive and the existing columns should be defined within #. * C The And operator is also used to join multiple conditions. Example: .. code-block:: python if(#age#>=0 and #age#<=18,'Child',(if(#age#>=19 and #age#<=50,'Adult' ,(if(#age#>=51 and #age#<=70,'Mature','OLD' ))))) as <> In the above example the values in the column age if it is between 0 to 18 then a NewAgeColumn2 is created where the age values between 0 to 18 is classified as 'Child', age from 19 to 50 classified as 'Adult', age between 51 to 70 classified as 'Wise' and rest classified as 'Old'. .. figure:: images/if_else.png :class: with-shadow :scale: 100 *image If Else function* | ============================== Ifnull ------------------------------- Ifnull ( ): The function returns a specified value if the expression is NULL. If the expression is NOT NULL, the function returns the expression Syntax: .. code-block:: python ifnull(#Column#, value) as <> New column name: NewIfNullColumn Example: .. code-block:: python ifnull(#Balance#, 0) as <> In case the value in a selected column equals to 0, the function will return a null value | ============================== Initcap ------------------------------- Initcap ( ): The function returns the expression with the first letter in each word in uppercase. Rest of the letters are in lowercase. Words are delimited by a white space Syntax: .. code-block:: python initcap(#Column#) as <> New column name: NewInitCapColumn Example: .. code-block:: python initcap(#Gender#) as <> In case the variable like 'male' and 'female' start with a lower case letter, the new column after applying the function will have values 'Male' and 'Female' | ============================== Int ------------------------------- Int ( ): The function returns the numeric integer equivalent from a given expression Syntax: .. code-block:: python int(#Column#) as <> New column name: NewIntColumn Example: .. code-block:: python int(#Column#) as <> In case we have a numerical column and one of the values is 7.75, the Int number after transformation will be 7 | ============================== .. _insert-trans: Insert ------------------------------- insert( ): Insert word at the position specified in the field. Syntax: .. code-block:: python insert(#column#, position,'_WordToInsert') as <> The first word is the field name within # where the word needs to be inserted. Example #occupation# The second is the number which is the position where the word will be inserted. Example the number 4 The third part is the text / word with double quotes that needs to be inserted. Example '_SERVICE' New column name: NewColumn .. note:: * A Applies to Categorical and Text data type and the new column which is created is Categorical type. * B If the field does not have any word in the field the insert function will not insert the word. Example: .. code-block:: python insert(#occupation#,4,'_SERVICE') as <> The word '_SERVICE' is inserted after the 4th position of the characters found in the field #occupation#. | ============================== Instr ------------------------------- Instr ( ): The function returns the position of the first occurrence of a string in another string Syntax: .. code-block:: python instr(#Column#, 'substr') as <> New column name: NewInstrColumn Example: .. code-block:: python instr(#Name#, 'Mr.') as <> We are searching for an expression/string 'Mr' in a name column and the function will tell us a position number where the string was found. In case the string is not there, the function returns '0'. | ============================== Inverse sign ------------------------------- Inverse_sign ( ): The function returns the negated values i.e. positive to negative and vice versa of given numerical column Syntax: .. code-block:: python inverse_sign(#Column#) as <> New column name: NewInvSignColumn Example: .. code-block:: python inverse_sign(#Temeprature#) as <> In a selected 'Temperature' column we will have values like 20, 7, -6 and -18 and after the transformation the function will return values -20, -7, 6 and 18. | ============================== Is_holiday ------------------------------- is_holiday( ): Check whether the given date column is holiday or not based on country. Syntax: .. code-block:: python is_holiday(#column#, 'countrycode') as <> New column name: Newcol_isholiday **For the list of country codes click on the link below:** |location_link11|. .. |location_link11| raw:: html List of country codes Example: .. code-block:: python is_holiday(#DOB#, 'US') as <> Above example shows the given date column is holiday or not based on country | ============================== Is_Text ------------------------------- is_text( ): Conditional statement to check if the each field from the given column is Text. Returns True or False. Condition: A string is considered as text, if its length of characters is greater than 50. Syntax: .. code-block:: python is_text(#column#) AS <> New column name: Newcol_istext .. note:: * Applies for both Numerical and Categorical datatypes to check whether the value is Text or Not Example: .. code-block:: python is_text(#age#) as <> Checks whether the string in the field is text, returns condition True or False | ============================== Is_weekend ------------------------------- is_weekend( ): Check whether the given date column is weekday or not based on country. Syntax: .. code-block:: python is_holiday(#column#, 'countrycode') as <> New column name: Newcol_isweekend **For the list of country codes click on the link below:** |location_link12|. .. |location_link12| raw:: html List of country codes | .. note:: * The is_weekend function is only applied for date_columns * Countrycode field should have valid codes Example: .. code-block:: python is_weekend(#DOB#, 'US') as <> Above example shows the given date column is weekend or not based on country | ============================== Isnan ------------------------------- Isnan ( ): The function determines whether a value is NaN (Not-a-Number). This function returns true if the value equates to NaN. Otherwise it returns false Syntax: .. code-block:: python isnan(#Column#) as <> New column name: NewIsNaNColumn Example: .. code-block:: python isnan(#PasswordColumn#) as <> We are checking if a value in a selected "PasswordColumn" is Not A Number - if we have value '567' the function will return 'false', as the expression is a number, if we have value 'blue' then function will return 'true' as the expression is not a number. | ============================== Isnull ------------------------------- Isnull ( ): The function returns 1 or 0 depending on whether an expression is NULL. If expression is NULL, this function returns 1 as true. If not, it returns 0 as false. Syntax: .. code-block:: python isnull(#Column#) as <> New column name: NewIsNullColumn Example: .. code-block:: python isnull(#Column#) as <> We are checking if a value in a selected "Column" is NULL. If the value in the column is '1' the function will return 'False' as the function is not NULL. | ============================== Is not null ------------------------------- Is not null ( ): The function returns 1 or 0 depending on whether an expression is NULL. If expression is not NULL, this function returns 1 as true. If it is NULL, it returns 0 as false. Syntax: .. code-block:: python isnotnull(#Column#) as <> New column name: NewIsNotNullColumn Example: .. code-block:: python isnotnull(#Column#) as <> We are checking if a value in a selected "Column" is NOT NULL. If the value in the column is '1' the function will return 'True' as the function is not NULL. | ============================== Is Numeric ------------------------------- is_numeric( ): Conditional statement to check if the value in the field is numeric, returns True or False Syntax: .. code-block:: python is_numeric(#Age#) AS <> New column name: NewColumn .. note:: Applies to Categorical data type and the new column which is created is Categorical type. Example: .. code-block:: python is_numeric(#Age#) AS <> Checks whether the text in the field is numerical, returns condition True or False. | ============================== Last_day ------------------------------- Last_day ( ): The function extracts the last day of the month for a given date Syntax: .. code-block:: python last_day(#Column#) as <> New column name: NewLDColumn Example: .. code-block:: python last_day(#DateColumn#) as <> If we have a Date Column with a value of '2021-01-01'the function will return the last day of the respective month so '2021-01-31'. | ============================== Least ------------------------------- Least ( ): The function returns the smallest value from the list of arguments Syntax: .. code-block:: python least(#Column1#, #Column2#, ..., #ColumnN#) as <> New column name: NewLeastColumn Example: .. code-block:: python least(#Balance#, #EstimatedSalary#) as <> If we have numerical columns with multiple values, the function will select the smallest value. If one column contains a value of 14,567 and second column a value of 6,553 the function will return 6,553 as the Least value. | ============================== Left ------------------------------- **For video tutorial see:** |location_link13|. .. |location_link13| raw:: html SEDGE YouTube channel - L functions left( ): Extract left characters from the fields till the numerical count. Syntax: .. code-block:: python left(#nationality#,3) as <> New column name: NewColumn Above example the function extracts the first 3 characters in the field from left. .. note:: Applies to Categorical, Text, Numerical and Date data type and the new column which is created is Categorical type. Example: .. code-block:: python left(#nationality#,3) as <> Example above, function extracts the first 3 characters in the field from left | ============================== .. _length-trans: Length ------------------------------- length( ): Gives the length of the number of characters in the field Syntax: .. code-block:: python length(#column1#) AS <> New column name: NewColumn .. note:: Applies to Categorical, Numerical, Text and date data type and the new column which is created is Numerical type. Example: .. code-block:: python length(#column1#) AS <> Generates the length of the characters in the field. | ============================== Levenshtein ------------------------------- Levenshtein ( ): The function returns the Levenshtein distance between two strings. The Levenshtein distance is the number of characters you have to replace, insert or delete to transform string1 into string2 Syntax: .. code-block:: python levenshtein(#Column1#, #Column2#) as <> New column name: NewLevenColumn Example: .. code-block:: python levenshtein(#Password1#, #Password2#) as <> If one value is "tomato" and second one is "mate" the Levenshtein distance is 3 - as we need to replace, insert or delete 3 characters to transform "tomato" into "mate". | ============================== Ln ------------------------------- Ln ( ): The function returns the natural logarithm of a number Syntax: .. code-block:: python ln(#Column#) as <> New column name: NewLnColumn Example: .. code-block:: python ln(#NumericalColumn#) as <> If the value in a column is '3' the Ln function will return 1.0986122886681098 as a result. | ============================== Locate ------------------------------- Locate ( ): The function identifies the position of the first occurrence of a substring in a string. If the substring is not found in the original string, the function returns 0. The function ignores upper or lower cases Syntax: .. code-block:: python locate('substr', #Column#) as <> New column name: NewLocColumn Example: .. code-block:: python locate('r', #Surname#) as <> We are searching for the expression "r" in the surname column. Surname "Nobe" returns 0 as it does not contain an "r". Surname "Taylor" returns "6" as the expression "r" is at the 6th position. | ============================== Log10 ------------------------------- Log10 ( ): The function returns the natural logarithm of a number to base 10 Syntax: .. code-block:: python log10(#Column#) as <> New column name: NewLog10Column Example: .. code-block:: python log10(#NumericColumn#) as <> We are searching for a natural logarithm of a number to base 10. In case the value is 4 the function will return 0.602059991328. | ============================== Log1p ------------------------------- Log1p ( ): The function returns the natural logarithm (base e) of 1 + a given number. E is the Eulers number which is approximately 2,718 Syntax: .. code-block:: python log1p(#Column#) as <> New column name: NewLog1pColumn Example: .. code-block:: python log1p(#NumericColumn#) as <> We are searching for a natural logarithm (base e) of 1 + a given number. In case the value is 4 the function will return 1.6094379124341003. | ============================== Log2 ------------------------------- Log2 ( ): The function returns the natural logarithm of a number to base 2 Syntax: .. code-block:: python log2(#Column#) as <> New column name: NewLog2Column Example: .. code-block:: python log2(#NumericColumn#) as <> We are searching for a natural logarithm with base 2 of a given number. In case the value is 4 the function will return 2. | ============================== .. _lower-trans: Lower ------------------------------- lower( ): converts characters to lower case. Syntax: .. code-block:: python lower(#column1#) AS <> New column name: NewColumn .. note:: Applies to categorical and text data type and the new column which is created is Categorical type. Example: .. code-block:: python lower(#column1#) AS <> Converts characters to lower case. | ============================== Lpad ------------------------------- Lpad ( ): The function left-pads a string with another string, to a certain specified length Syntax: .. code-block:: python lpad(#Column#, length, 'pad') as <> New column name: NewLpadColumn Example: .. code-block:: python lpad(#Hello#, 8, 'ABC') as <> The function will return ABCHello. | ============================== Ltrim ------------------------------- Ltrim ( ): The function removes leading spaces from a string Syntax: .. code-block:: python ltrim(#Column#) as <> New column name: NewLtrimColumn Example: .. code-block:: python ltrim(# hello#) as <> The function will return 'hello' without spaces on the left side. | ============================== Logarithm ------------------------------- log( ): gives the logarithmic value of the number. A natural log value is logarithm to the base of the mathematical constant e, where e is an irrational and transcendental number approximately equal to 2.71828 Similarly , user can use any base like 2,10 to calculate the logarithmic value. Syntax: .. code-block:: python log(Base value , #column1#) AS <> New column name: NewColumn .. note:: Applies to Numerical data type and the new column which is created is Numerical type. Example: .. code-block:: python log('e',#column1#) AS <> Example: .. code-block:: python log(2,#column1#) AS <> Performing logarithmic transformation help in data normality especially if the data is skewed. The log transformation can be used to make highly skewed distributions less skewed. Log transformation is important as it makes patterns in the data more interpretable. ========== | ============================== Max ------------------------------- max( ): This function returns the maximum value in the specified column. Syntax: .. code-block:: python max(#column#) as <> New column name: Newcol_max .. note:: The max function is only applied for numerical columns Example: .. code-block:: python min(#Age#) as <> Above example will return the maximum of age. | ============================= Md5 ------------------------------- Md5 ( ): The Message Digest Algorithm 5 (MD5) is a cryptographic hash algorithm that is used to create a 128-bit string value from an arbitrary length string Syntax: .. code-block:: python md5(#Column#) as <> New column name: NewMd5Column Example: .. code-block:: python md5(#France#) as <> The function will return 0309a6c666a7a803fdb9db95de71cf01 as a result. | ============================== Mean ------------------------------- mean( ): This function returns the average of values in the specified column. Syntax: .. code-block:: python mean(#column#) as <> New column name: Newcol_mean .. note:: The mean function is only applied for numerical columns. Example: .. code-block:: python mean(#Age#) as <> Above example will return the average value of age column. | ============================== Median ------------------------------- median( ): This function returns the median value of a range of values in the specified column. Syntax: .. code-block:: python median(#column#) as <> New column name: Newcol_median .. note:: The median function is only applied for numerical columns. Example: .. code-block:: python median(#Age#) as <> Above example will return the median value of age column. | ============================== Milisec_todate ------------------------------- Milisec_todate ( ): The function creates timestamp from the number of milliseconds since UTC epoch - Unix epoch is 00:00:00 UTC on 1 January 1970 Syntax: .. code-block:: python millisec_todate(#Column#) as <> New column name: NewMiliColumn Example: .. code-block:: python millisec_todate(#2021/01/01 15:26:40#) as <> The above mentioned date (1st January 2021, 15:26:40) will be converted into 1609511200000. | ============================== Min ------------------------------- min( ): This function returns the minimum value in the specified column. Syntax: .. code-block:: python min(#column#) as <> New column name: Newcol_min .. note:: The min function is only applied for numerical columns Example: .. code-block:: python min(#Age#) as <> Above example will return the minimum of age. | ============================== Minute ------------------------------- Minute ( ): The function returns minute part of a time/datetime (from 0 to 59) Syntax: .. code-block:: python minute(#Datecolumn#) as <> New column name: NewMinuteColumn Example: .. code-block:: python minute(#2021/01/01/15:26:40#) as <> The above mentioned date (1st January 2021, 15:26:40) will give results of '26' as the extracted minute. | ============================== Mod ------------------------------- Mod ( ): The function helps find a remainder after a number (dividend) is divided by another number (divisor) Syntax: .. code-block:: python mod(#Column1#, #Column2#) as <> New column name: NewModColumn Example: .. code-block:: python mod(#30#, #4#) as <> The above mentioned number will give us a result of '2' as when we divide 30 by 4 we will get 7 in full numbers and 2 is the remainder. | ============================== Month ------------------------------- Month ( ): The function extracts the month from a given date as number between 1 and 12 Syntax: .. code-block:: python month(#Datecolumn#) as <> New column name: NewMonthColumn Example: .. code-block:: python month(#2021/01/01/15:26:40#) as <> The above mentioned date (1st January 2021, 15:26:40) will give results of '1' as the extracted month. | ============================== Months_between ------------------------------- Months_between ( ): The function returns number of months between date1 and date2. If date1 is later than date2, then the result is positive. If date1 and date2 are on the same day of month, or both are the last day of month, time of the day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits. If we mention 'false' in the syntax, the number will not be rounded. Syntax: .. code-block:: python months_between(#EndDatecolumn#, #StartDatecolumn#, false) as <> New column name: NewMBColumn Example: .. code-block:: python months_between(#2003/01/01#, #2003/03/14#, false) as <> The above mentioned date difference will be calculated as -2.41935483870968. | ============================== Multiple Replacement ------------------------------- **For video tutorial see:** |location_link14|. .. |location_link14| raw:: html SEDGE YouTube channel - M functions mul_replacement( ): Multiple replacement replaces multiple characters from defined position, with new characters. Syntax: .. code-block:: python mul_replacement(#marital_status#,4,7,'_OK') As <> New column name: NewColumn .. note:: Applies to Categorical data type and the new column which is created is Categorical type. Example: .. code-block:: python mul_replacement(#marital_status#,4,7,'_OK') As <> Assuming in the example above, the data in the column #marital_status# is 'Married', it will replace 'ried' with '_OK', the output will be 'Marr_OK'. ========== | ============================== Negate_bool ------------------------------- Negate_bool(): The Negate_bool function returns the negation of the boolean value (True/False) in the column. Syntax: .. code-block:: python Negate_bool(#column#) as <> New column name: Newcol_negate .. note:: The Negate_bool function is only applied to the column contains boolean values. Example: .. code-block:: python Negate_bool(#Text#) as <> Above example will return the negation value of Text column. | ============================== Negative ------------------------------- Negative ( ): The function returns the negative absolute of values from a given numerical column Syntax: .. code-block:: python negative(#Column#) as <> New column name: NewNegColumn .. note:: Applies for Numerical column and the resulting column is numerical data type Example: .. code-block:: python negative(#data#) as <> The function will first perform the absolute value and after that it will convert the value into a negative one | ============================== Next_day ------------------------------- Next_day ( ): The function returns the first day after the selected day Syntax: .. code-block:: python next_day(#Datecolumn#, 'day_of_week') as <> New column name: NewNDColumn Example: .. code-block:: python next_day(#2020-08-10#, 'WED') as <> The function will return the next Wednesday after the mentioned date which is 2020-08-12. | ============================== Normalize ------------------------------- Data normalization is the process of scaling the data from the given numerical column down to an equivalent scale of 0 to 1. Normalize( ): Creates a new normalized column Syntax: .. code-block:: python normalize(#column#) AS <> New column name: New_normalized_col .. note:: The normalize() function is only applied for Numerical datatype columns Example: .. code-block:: python normalize(#age#) as <> Returns the normalized column for given numerical column. | ============================== Now ------------------------------- Now( ): The function returns the current date and time. Can be used to compare a different date column with a current date. Syntax: .. code-block:: python now() as <> New column name: NewNowColumn Example: .. code-block:: python now() as <> The function returns current date and time - 2020-30-05 15:20:43 | ============================== NPER ------------------------------- nper( ):The NPER function returns the number of periods for an investment based on periodic, constant payments and a constant interest rate. * Annual_Interest_Rate - The interest rate per period * Amount_of_Payments - The payment made each period. Generally, it contains principal and interest but no other fees and taxes. * Loan_amount - The present value, or the lump-sum amount that a series of future payments is worth right now. * Final_balance - The future value or the cash balance * Type - Enter "end" or omit for investing at the end of each period, "begin" for the beginning of each period. Syntax: .. code-block:: python nper(#Annual_Interest_Rate#, #Amount_of_Payments#, #Loan_amount#, #Final_balance#,'end') as <> New column name: Newcol_nper Example: .. code-block:: python nper(#int_rate#, #open_acc#, #pv#) as <> The above example returns the expected value. | ========================= Nullif ------------------------------- Nullif ( ): The function returns null if two expressions are equal, otherwise it returns the first expression Syntax: .. code-block:: python nullif(#Column#, 'value_toNull') as <> New column name: NewNullifColumn Example: .. code-block:: python nullif(#Age#, '55') as <> With every occurrence of 55 in the "Age" column the value will be replaced by a NULL | ============================== Nvl2 ------------------------------- Nvl2 ( ): The function allows you to substitute a value when a null value is encountered and also when a non-null value is encountered Syntax: .. code-block:: python nvl2(#Column#, 'value_ifnotNull', 'value_ifNull') as <> New column name: NewNvl2Column | ============================== Octet_length ------------------------------- Octet_length ( ): The function returns an integer indicating the number of bytes in the input string. Syntax: .. code-block:: python octet_length(#Column#) as <> New column name: NewOctColumn Example: .. code-block:: python octet_length(#Geography#) as <> A variable 'France' will return a number 6 as the length. Germany will be 7 etc. | ============================== Offday_count ------------------------------- offday_count( ): Returns the number of holidays between two given date columns(Include holidays and weekends) based on the country Syntax: .. code-block:: python offday_count(#column1#,#column2#, 'countrycode') as <> New column name: Newcol_offdaycount **For the list of country codes click on the link below:** |location_link15|. .. |location_link15| raw:: html List of country codes .. note:: * The offday_count function is only applied for date_columns. * Countrycode field should have valid codes Example: .. code-block:: python offday_count (#DOB#, #Date of Hire#, 'US') as <> Above example returns the number of holidays between two given date columns(Include holidays and weekends) based on the country | ============================== One Replacement ------------------------------- **For video tutorial see:** |location_link16|. .. |location_link16| raw:: html SEDGE YouTube channel - O functions One_replacement( ): One replacement replaces single character from defined position, with new characters. Syntax: .. code-block:: python one_replacement(#Occupation#,4,'_Yes') as <> New column name: NewOccupationColumn .. note:: Applies to Categorical and text data type and the new column which is created is Categorical type. Example: .. code-block:: python one_replacement(#Occupation#,4,'_Yes') as <> Assuming in the example above, the data in the column #Occupation# is 'Prof', it will replace 'f' with '_YES', the output will be 'Pro_Yes'. | ============================== Pmod ------------------------------- Pmod ( ): The Pmod function returns the positive modulus of a number Syntax: .. code-block:: python pmod(#Column1#, #Column2#) as <> New column name: NewPmodColumn | ============================== PMT ------------------------------- pmt( ): The PMT Function returns the regular and constant repayments for a loan or mortgage required to reduce the balance to zero, or an amount you specify. It’s based on a constant interest rate.  Syntax: .. code-block:: python pmt(#Annual_Interest_Rate#, #Number_of_Payments#, #Loan_amount#, #Final_balance#, 'end') as <> New column name: : Newcol_pmt Example: .. code-block:: python pmt(#int_rate#, #open_acc#, #pv#) as <> The above example returns the expected value. | ============================== Positive ------------------------------- Positive ( ): Returns the positive(or absolute) of values from the given numerical column Syntax: .. code-block:: python positive(#Column#) as <> New column name: NewPosColumn .. note:: Applies for Numerical column and the resulting column is numerical data type Example: .. code-block:: python positive(#data#) as <> Returns the positive(absolute) value of given value | ============================== Power ------------------------------- **For video tutorial see:** |location_link17|. .. |location_link17| raw:: html SEDGE YouTube channel - P functions pow( ): The power (or exponent) of a number is the number of times to use the number in a multiplication Syntax: .. code-block:: python pow(#age#,3) as <> New column name: NewColumn .. note:: * A Applies to Numerical data type and the new column which is created is Numerical type. * B The power transformation is used when the numbers are clustered too close to each other. Power Transformer is used to make the data distribution more-Gaussian (Normal). Example: .. code-block:: python pow(#age#,3) as <> The power (or exponent) of a number is the number of times to use the number in a multiplication. In the example above the power is raised to 3, that is multiplying the number in the field three times. | ==================== Proper ------------------------------- Proper( ): The Proper Case converts categorical and text values to upper case for the first character of each word and to lower case for subsequent characters in the word. Syntax: .. code-block:: python Proper(#column1#) AS <> New column name: NewProperColumn .. note:: * A Applies to categorical and text data type and the new column which is created is Categorical type. * B Does not apply to Numerical and Data type columns Example: .. code-block:: python Proper(#column1#) AS <> Converts characters to proper case (Starting Alphabet is capitals, rest are lower case). | ============================== Quarter ------------------------------- Quarter( ): The function returns the quarter of the year for a given date value (a number from 1 to 4) * January-March will return 1 * April-June will return 2 * July-Sep will return 3 * Oct-Dec will return 4 Syntax: .. code-block:: python quarter(#Datecolumn#) as <> New column name: NewQuartColumn Example: .. code-block:: python quarter(#2021-02-02#) as <> The function will return 1 as the number of quarter | ============================== Radians ------------------------------- **For video tutorial see:** |location_link18|. .. |location_link18| raw:: html SEDGE YouTube channel - R functions radians( ): Converts angles in degrees to radians. 1 degree is equivalent to (π/180) radians. Multiply the number of degrees with by π/180 to convert it to radian terms. Syntax: .. code-block:: python radians(#angles#) AS <> New column name: NewRadiansColumn .. note:: Applies to Numerical data type and the new column which is created is Numerical type. Example: .. code-block:: python radians(#angles#) AS <> Converts angles in degrees to radians. .. figure:: images/angles2radians.png :class: with-shadow :scale: 100 *image Angles to Radians* | ============================== Rand ------------------------------- Rand ( ): The function returns a random number between 0 (inclusive) and 1 (exclusive). If seed is specified, it returns a repeatable sequence of random numbers. If seed is not specified, it returns a completely random number Syntax: .. code-block:: python rand(Seed) as <> New column name: NewRandColumn | ============================== Randn ------------------------------- Randn ( ): The function returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution Syntax: .. code-block:: python randn(Seed) as <> New column name: NewRandnColumn | ============================== Reciprocal ------------------------------- reci( ): The reciprocal transformation is defined as the transformation of x to 1/x. The transformation reverses the order of values with the same sign. Syntax: .. code-block:: python reci(#column1#) AS <> New column name: NewReciprocalColumn .. note:: * A Applies to Numerical data type and the new column which is created is Numerical type. * B The transformation can only be used for non-zero values and also values should not be missing. * C If the standard deviation is proportional to the mean squared, a reciprocal transformation can be performed. * D Transformation helps to improve interpretability. If fuel economy of car is being compared, then there are two way of looking at the fuel consumption a. liters per kilometer or b. kilometers per liter. The reciprocal of 'liters per Kilometer' is 'kilometers per liter'. Example: .. code-block:: python reci(#column1#) AS <> The reciprocal transformation is defined as the transformation of x to 1/x. .. figure:: images/reciprocal.png :class: with-shadow :scale: 100 *image MilesPerGallon to GallonsPerMile using reciprocal* | ============================== Regexp_replace ------------------------------- Regexp_replace ( ): The function replaces the portions of a string that match a given regular expression with the contents of another string * #Column# - The string perform a search-and-replace operation upon * Reg_exp - the JavaScript style regular expression to evaluate * Replace_str - the string with which the matched substrings are to be replaced Syntax: .. code-block:: python regexp_replace(#Column#, 'regexp_replace', 'replace_str') as <> New column name: NewRegRepColumn | ============================== Regexp_extract ------------------------------- Regexp_extract ( ): This string function is used in search operations for sophisticated pattern matching including repetition and alternation * #Column# - The string to search for strings matching the regular expression * Reg_exp - the JavaScript style regular expression to evaluate * Group_idx - an optional regular expression group number, defining which portion of the matching string will be returned Syntax: .. code-block:: python regexp_extract(#Column#, 'reg_exp', group_idx) as <> New column name: NewRegExtColumn | ============================== Remove ------------------------------- remove( ): The remove function removes a certain string pattern from the text. Syntax: .. code-block:: python remove(#column1#, 'Pattern to be removed') AS <> New column name: NewColumn .. note:: * A Applies to string / text column * B The transformation can only be used for non-zero values and also values should not be missing. Example: .. code-block:: python remove(#column1#, 'USD') AS <> In the above example all instances of string USD, will be removed from the text in the #column1# | ============================== .. _repeat-trans: Repeat ------------------------------- repeat( ): Repeats the value in the field the number of times as defined. Syntax: .. code-block:: python repeat(#MAKE#,2) as <> New column name: Make_repeat_NewCol .. note:: Applies to Numerical, categorical, text data type and the new column which is created is Categorical type. Example: .. code-block:: python repeat(#MAKE#,2) as <> This will repeat the values in the field 'Make' twice. | ============================== .. _replace-trans: Replace ------------------------------- replace( ): The replace function replaces a string pattern with another string pattern within a text / string column. Syntax: .. code-block:: python replace(#column1#, 'Existing string pattern','string to be replaced') AS <> New column name: NewColumn .. note:: * A Applies to string / text column * B The transformation can only be used for non-zero values and also values should not be missing. Example: .. code-block:: python replace(#column1#, 'EUR', 'USD') AS <> In the above example all instances of string EUR, is replaced with string USD, from the text in the #column1#. | ============================== Reverse ------------------------------- reverse( ): Reverses the order of the data. If the value is ABCDE, a reverse function will reverse the value to EDCBA Syntax: .. code-block:: python reverse(#position#) as <> New column name: NewColumn .. note:: Applies to Categorical, Text and Numerical data type and the new column which is created is Numerical type. Example: .. code-block:: python reverse(#position#) as <> Reverses the order of the character .. figure:: images/Reversetrans.png :class: with-shadow :scale: 100 *image Text Reversed* | ============================== Right ------------------------------- right( ): Extract right characters from the fields till the numerical count. Syntax: .. code-block:: python right(#nationality#,3) as <> New column name: NewColumn Above example the function extracts the first 3 characters in the field from right. .. note:: Applies to Categorical, Text, Numerical and Date data type and the new column which is created is Categorical type. Example: .. code-block:: python right(#nationality#,3) as <> Example above, function extracts the first 3 characters in the field from right. | ============================== Rint ------------------------------- rint( ): The rint functions returns a floating-point value that represents the nearest integer to x. Halfway values are rounded according to the current setting of the floating-point rounding mode Syntax: .. code-block:: python rint(#Column#) as <> New column name: NewRintColumn | ============================== Round ------------------------------- round( ): Rounds the value. Syntax: .. code-block:: python round(#MPG#,1) as <> New column name: NewcolRound .. note:: Applies to Numerical data type and the new column which is created is Numerical type. Example: .. code-block:: python round(#MPG#,1) as <> Rounds the value. .. figure:: images/rounding.png :class: with-shadow :scale: 100 *image rounding of Miles Per Gallon (MPG)* | ============================== Row_index ------------------------------- row_index(): The function assigns a unique, sequential index number to each row, starting with 1 Syntax: .. code-block:: python row_index() as <> New column name: NewRowInColumn Example: .. code-block:: python row_index() as <> Returns the index of each row | ============================== Rtrim ------------------------------- Rtrim( ): The function removes trailing spaces from a string Syntax: .. code-block:: python rtrim(#Column#) as <> New column name: NewRtrimColumn Example: .. code-block:: python rtrim(# hello #) as <> The function will return 'hello' without spaces on the sides. | ============================== Rpad ------------------------------- Rpad( ): The function right-pads a string with another string, to a certain length (when string1 is not null) Syntax: .. code-block:: python rpad(#Column#, length, 'pad') as <> New column name: NewRpadColumn Example: .. code-block:: python rpad(#Hello#, 8, 'ABC') as <> The function will return HelloABC. | ============================== Running total ------------------------------- running_total( ): This function returns a sequence of partial sums of given column which means displaying the summation of the data as it grows over time. Syntax: .. code-block:: python running_total(#column#) as <> New column name: Newcol_total .. note:: The running_total function is only applied for numerical columns. Example: .. code-block:: python running_total(#Age#) as <> Above example will return the running_total of age column. | ============================== Second ------------------------------- second( ): The function returns the seconds part of a time/datetime (from 0 to 59). Syntax: .. code-block:: python second(#Datecolumn#) as <> New column name: NewSecColumn Example: .. code-block:: python second(#2021-09-16 11:56:33#) as <> The function will return 33. | ============================== Sha1 ------------------------------- Sha1( ): The function calculates an SHA-1 160-bit checksum for a string. The function returns a value as a binary string of 40 hex digits. In case the string supplied as the argument is NULL, the function returns NULL. Syntax: .. code-block:: python sha1(#Column#) as <> New column name: NewSha1Column Example: .. code-block:: python sha1(#Sedge#) as <> The function will return fca21fee4d117ec70a95dbee415c01e5c6e86df3 | ============================== Sha2 ------------------------------- Sha2( ): The function uses the SHA2 cryptographic hash function to convert a variable-length string into a character string. The character string is a text representation of the hexadecimal value of the checksum with the specified number of bits Syntax: .. code-block:: python sha2(#Column#, bitLength) as <> New column name: NewSha2Column .. note:: * String - A variable-length string * Integer - The number of bits in the hash functions. Valid values are 0 (same as 256), 224, 256, 384, and 512 | ============================== Shiftleft ------------------------------- Shiftleft( ): The function performs a left shift bitwise Syntax: .. code-block:: python shiftleft(base, #Column#) as <> New column name: NewShLeftColumn | ============================== Shiftright ------------------------------- Shiftright( ): The function performs a right shift bitwise (signed) Syntax: .. code-block:: python shiftright(base, #Column#) as <> New column name: NewShRightColumn | ============================== Shiftrightunsigned ------------------------------- Shiftrightunsigned( ): The function performs a right shift bitwise (unsigned) Syntax: .. code-block:: python shiftrightunsigned(base, #Column#) as <> New column name: NewShRiUnColumn | ============================== Sign ------------------------------- **For video tutorial see:** |location_link19|. .. |location_link19| raw:: html SEDGE YouTube channel - S functions Sign( ): Converts if value in column is greater than 0 it returns 1 and if value is less than 0 it returns -1 and if the value is 0 it returns 0 Syntax: .. code-block:: python Sign(#MPG#) as <> New column name: Newcolsign .. note:: * A Applies to Numerical data type and the new column which is created is Numerical type. Example: .. code-block:: python Sign(#temperature#) as <> If the value in the column temperature is 36, then the new column (Newcol) will have value 1, If the value in the column temperature is -25, then the new column (Newcol) will have value -1 and If the value in the column temperature is 0, then the new column (Newcol) will have value 0 | ============================== Sin ------------------------------- Sin( ): The function returns the sine of a number Syntax: .. code-block:: python sin(#Column#) as <> New column name: NewSinColumn Example: .. code-block:: python sin(#1#) as <> The function will return the result of 0.8414709 | | ============================== Sinh ------------------------------- Sinh( ): The function returns the hyperbolic sine of a number Syntax: .. code-block:: python sinh(#Column#) as <> New column name: NewSinHColumn Example: .. code-block:: python sinh(#1#) as <> The function will return the result of 1.1752011 | ============================== Soundex ------------------------------- Soundex( ): The function evaluates expression and returns the most significant letter in the input string followed by a phonetic code Syntax: .. code-block:: python soundex(#Column#) as <> New column name: NewSoundColumn .. note:: * Characters that are not alphabetic are ignored * If expression evaluates to the null value, null is returned | ============================== Square Root ------------------------------- sqrt( ): Finds the square root of the number. Syntax: .. code-block:: python sqrt(#MPG#) as <> New column name: Newcolsqrt .. note:: * A Applies to Numerical data type and the new column which is created is Numerical type. * B The function can be used for square root transformation of positive values. * C Do not apply square root of negative values as the system will give error. Example: .. code-block:: python sqrt(#MPG#) as <> Finds the square root of the positive value. .. figure:: images/sqrt.png :class: with-shadow :scale: 100 *image Square Root of Miles Per Gallon (MPG)* | ============================== Step Delete ------------------------------- step_delete( ): Delete single character at certain position and then skips the next characters, and then deleting the Single character. Syntax: .. code-block:: python step_delete(#Occupation#,1,10,2) as <> New column name: NewColSkipDelete .. note:: * A Applies to Categorical and text data type and the resulting column is categorical data type. * B Does not apply to numerical data type. Example: .. code-block:: python step_delete(#Occupation#,1,10,2) as <> In the above example data in the field Occupation is selected. The step delete is applicable from 1st Character to 10th character and the skips 2 characters. Example if the word is 'Life is beautiful' then this function will work as per the below image. .. figure:: images/Step_Delete.png :class: with-shadow :scale: 80 | *image Step Delete function working explanation* | ============================== Standardize ------------------------------- standardize( ): Data standardization is the process of rescaling the data from the given numerical column to have a mean of 0 and a standard deviation of 1 (unit variance). The function creates a new standardized column Syntax: .. code-block:: python standardize(#Column#) as <> New column name: NewStandColumn .. note:: * The standardize fuction is only applied for Numerical datatype columns Example: .. code-block:: python standardize(#age#) as <> Returns the standardized column for given numerical column | ============================== Stddev ------------------------------- stddev( ): This function returns the standard deviation of values in the specified column. Syntax: .. code-block:: python stddev(#column#) as <> New column name: Newcol_stddev .. note:: The stddev function is only applied for numerical columns. Example: .. code-block:: python stddev(#Age#) as <> Above example will return the standard deviation value of age column. | ============================== String ------------------------------- String( ): The function converts the value expression to the string data type Syntax: .. code-block:: python string(#Column#) as <> New column name: NewStringColumn | ============================== .. _substring-trans: Substring ------------------------------- Substring( ): The function extracts a substring from a string (starting at any position) Syntax: .. code-block:: python substring(#Column#, position, length) as <> New column name: NewSubColumn | ============================== Substring_index --------------------- Substring:index( ): The function returns a substring of a string before a specified number of delimiter occurs Syntax: .. code-block:: python substring_index(#Column#, 'delimiter', count) as <> New column name: NewSubInColumn | ============================== Sum --------------------- sum( ): This function returns the sum of all values in the specified column. Syntax: .. code-block:: python sum(#column#) as <> New column name: Newcol_sum .. note:: The sum function is only applied for numerical columns. Example: .. code-block:: python sum(#Age#) as <> Above example will return the sum of all values in age column. | ============================== Tan --------------------- Tan( ): The function returns the tangent of a number Syntax: .. code-block:: python tan(#Column#) as <> New column name: NewTanColumn | ============================== Tanh --------------------- Tanh( ): The function returns the hyperbolic tangent of a number Syntax: .. code-block:: python tanh(#Column#) as <> New column name: NewTanHColumn Example: .. code-block:: python tanh(#1#) as <> The function will return the result of 0.7615941 | ============================== To_date --------------------- To_date( ): The function converts a string value to DATE data type value using the specified format Syntax: .. code-block:: python to_date(#Column#, 'Date_format_pattern') as <> New column name: NewToDateColumn **For Date Pattern Formats check:** |location_link20|. .. |location_link20| raw:: html Date Format Examples | ============================== To_timestamp --------------------- To_timestamp( ): The function converts a character datatype to a value of TIMESTAMP datatype format. The Timestamp_format_pattern specifies the format of character Syntax: .. code-block:: python to_timestamp(#Column#, 'Timestamp_format_pattern') as <> New column name: NewToTimeColumn **For Date Pattern Formats check:** |location_link21|. .. |location_link21| raw:: html Date Format Examples | ============================== To_unix_timestamp --------------------- To_unix_timestamp( ): The function returns the UNIX timestamp of the given time column. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC Syntax: .. code-block:: python to_unix_timestamp(#Column#, 'Date/time_format_pattern') as <> New column name: NewUnixColumn **For Date Pattern Formats check:** |location_link22|. .. |location_link22| raw:: html Date Format Examples | ============================== To_utc_timestamp --------------------- To_utc_timestamp( ): The function returns the UTC timestamp of the given time column. Syntax: .. code-block:: python to_utc_timestamp(#Datecolumn#, 'timezone') as <> **For the list of time zones click on the link below:** |location_link23|. .. |location_link23| raw:: html List of time zones New column name: NewUTCColumn | ============================== .. _trim-trans: Trim --------------------- **For video tutorial see:** |location_link24|. .. |location_link24| raw:: html SEDGE YouTube channel - T functions trim( ): Trim function returns a text value with the starting and ending spaces removed. It also removes unnecessary spaces between words in a string. Syntax: .. code-block:: python trim(#occupation#) as <> New column name: Newcoltrim .. note:: * A Applies to Categorical, Text data type and the new column which is created is Categorical type. * B Do not apply to Numerical or date type. Example: .. code-block:: python trim(#occupation#) as <> Trim returns the categorical value where the unnecessary space between words, ending and starting of word is removed. | =============================== Typeof --------------------- Typeof( ): The function allows the user to quickly check a variable's data type — or whether it is “undefined” or “null” Syntax: .. code-block:: python typeof(#Column#) as <> New column name: NewTypeOfColumn | ============================== Unbase64 --------------------- unbase64( ): The function decodes a BASE64 encoded string column and returns it as a binary column Syntax: .. code-block:: python unbase64(#Column#) as <> New column name: NewUnbaseColumn ============================== Unhex --------------------- unhex( ): The function converts hexa decimal value (base 16) to decimal numbers (base 10). As a base-16 numeral system, it uses 16 symbols. Syntax: .. code-block:: python unhex(#Column#) as <> New column name: NewUnhexColumn | ============================== Unix_timestamp --------------------- unix_timestamp( ): The function returns the UNIX timestamp of current time. The unix time stamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC Syntax: .. code-block:: python unix_timestamp() as <> New column name: NewUnixColumn Example: .. code-block:: python unix_timestamp() as <> Above example performed on Mon Jun 07 2021 14:16:31 GMT will return Unix Timestamp 1623075391 | ============================== Uuid --------------------- uuid( ): The function is used to return a Universal Unique Identifier (UUID) - this ID is random and unique for each row Syntax: .. code-block:: python uuid() as <> New column name: NewUuidColumn Example: .. code-block:: python uuid() as <> Above example will return a 36 character random string like for example 6521665d-ce8b-442a-b7c2-cfb4918f25fb | ============================== Unit converter --------------------- **For video tutorial see:** |location_link25|. .. |location_link25| raw:: html SEDGE YouTube channel - U functions unit_converter( ) : unit_converter of a number converts from one unit to another unit. The available units for conversion are listed below. Syntax: .. code-block:: python unit_converter(#temperature#, 'from_unit', 'to_unit') AS <> New column name: Newconvertedcolumn .. note:: The from_unit and the to_unit refer to the unit such as “C”. referring to Celsius or “F”, referring to Fahrenheit. The UNIT_CONVERTER value can only be applied to Numerical datatype columns. The new column name should not contain any special characters such as !@#$%^&*()-+= space If there are any missing rows in the numerical column, then the row will be ignored. The user can enter “from unit” and 'to unit' in upper case or lower case. Temperature units --------------------- Converts temperature units from Celsius to Fahrenheit and from Fahrenheit to celsius. Syntax: .. code-block:: python unit_converter(#temperature#, 'C', 'F') AS <> +--------------+------------+---------------+ |Temperature | From |Temperature. | +==============+============+===============+ | Celsius (C). | to | Fahrenheit(F) | +--------------+------------+---------------+ |Fahrenheit(F) | to |Celsius (C) | +--------------+------------+---------------+ Quantity units --------------------- Converts quantity units from Tablespoon to Teaspoon and from Teaspoon to Tablespoon. Syntax: .. code-block:: python unit_converter(#QuantityColumn#, 'TSP', 'TBS') AS <> +----------------+------+-----------------+ |Measure | From |Measure | +================+======+=================+ |Teaspoon (TSP) | to | Tablespoon (TBS)| +----------------+------+-----------------+ |Tablespoon (TBS | to |Teaspoon (TSP) | +----------------+------+-----------------+ Volume units --------------------- Converts volume units from gallons to liters and from liters to gallons. Syntax: .. code-block:: python unit_converter(#QuantityColumn#, 'GAL', 'L') AS <> Gallons (GAL) to Liters (L) Liters (L) to Gallons (GAL) Distance units --------------------- Converts distance units from miles to kilometers and from kilometers to miles. Syntax: .. code-block:: python unit_converter(#QuantityColumn#, 'MI', 'KM') AS <> Miles (MI) to Kilometers (KM) Kilometers (KM) to Miles (MI) Length units --------------------- Converts length units from inches to feet or centimeters and from feet to inches and centimeters and from centimeters to inches and feet. Syntax: .. code-block:: python unit_converter(#QuantityColumn#, 'IN', 'FEET') AS <> +----------------+------------+------------------+ |From Unit | |To Unit | +================+============+==================+ | Inches (IN) | to |Feet (FEET) | +----------------+------------+------------------+ | Feet (FEET) | to |Inches (IN | +----------------+------------+------------------+ | Inches (IN) | to | Centimeters (CM) | +----------------+------------+------------------+ |Feet (FEET) | to |Centimeters(CM) | +----------------+------------+------------------+ |Centimeters(CM) | to |Inches (IN) | +----------------+------------+------------------+ | Centimeters (CM)| to |Feet (FEET) | +----------------+------------+------------------+ | Inches (IN) | to |Feet (FEET) | +----------------+------------+------------------+ |Fahrenheit(F) | to |Celsius (C) | +----------------+------------+------------------+ Weight Units Converts weight units from kilograms to pounds and from pounds to kilograms. Syntax: .. code-block:: python unit_converter(#QuantityColumn#, 'KG', 'LBM') AS <> +----------------+------------+----------------+ |From Unit | |To Unit | +================+============+================+ | Kilograms (KG) | to |Pounds (LBM) | +----------------+------------+----------------+ | Kilograms (KG) | to |Grams (G) | +----------------+------------+----------------+ | Pounds (LBM) | to |Kilograms (KG). | +----------------+------------+----------------+ | Pounds (LBM) | to |Grams (G) | +----------------+------------+----------------+ |grams (G) | to |Kilograms (KG) | +----------------+------------+----------------+ |grams (G). | to |Pounds (LBM) | +----------------+------------+----------------+ ============================= .. _upper-trans: Upper --------------------- upper( ): converts characters to upper case. Syntax: .. code-block:: python upper(#column1#) AS <> New column name: NewColumn .. note:: Applies to categorical and text data type and the new column which is created is Categorical type. Example: .. code-block:: python upper(#column1#) AS <> Converts characters to upper case. ============================ Variance --------------------- variance( ): This function returns the variance of values in the specified column. Syntax: .. code-block:: python variance(#column#) as <> New column name: Newcol_variance .. note:: The variance function is only applied for numerical columns. Example: .. code-block:: python variance(#Age#) as <> Above example will return the variance of age column. | ============================= Weekday --------------------- weekday( ): The function returns the weekday number for a given date column Syntax: .. code-block:: python weekday(#Datecolumn#) as <> New column name: NewWeekdayColumn .. note:: 0 = Monday, 1 = Tuesday, 2 = Wednesday, 3 = Thursday, 4 = Friday, 5 = Saturday, 6 = Sunday Example: .. code-block:: python weekday(#2021-06-06#) as <> The function will return '6'for Sunday. | ============================== Weekend_count --------------------- weekend_count( ): Returns the number of weekend between two given date columns based on country Syntax: .. code-block:: python weekend_count(#column1#,#column2#, 'countrycode') as <> New column name: Newcol_weekendcount **For the list of country codes click on the link below:** |location_link26|. .. |location_link26| raw:: html List of country codes .. note:: * The weekend_count function is only applied for date_columns. * Countrycode field should have valid codes Example: .. code-block:: python weekend_count (#DOB#, #Date of Hire#, 'US') as <> Above example returns the number of weekend between two given date columns based on country | ============================== Weekofyear --------------------- Weekofyear( ): The function returns the week number for a given date (a number from 1 to 53) Syntax: .. code-block:: python weekofyear(#Datecolumn#) as <> New column name: NewWoYColumn .. note:: This function assumes that the first day of the week is Monday and the first week of the year has more than 3 days Example: .. code-block:: python weekofyear(#2020-01-23#) as <> Above example returns the number 4 as for the 4th week in the respective year | ============================== Workday_count --------------------- workday_count( ): Count the number of workdays between two given date columns based on country Syntax: .. code-block:: python workday_count(#column1#,#column2#, 'countrycode') as <> New column name: Newcol_workdaycount **For the list of country codes click on the link below:** |location_link27|. .. |location_link27| raw:: html List of country codes .. note:: * The workday_count function is only applied for date_columns. * Countrycode field should have valid codes Example: .. code-block:: python workday_count (#DOB#, #Date of Hire#, 'US) as <> Above example returns the number of workdays between two given date columns based on country | ============================== XXhash64 --------------------- XXhash64( ): The function returns a 64-bit hash value of the arguments/selected columns Syntax: xxhash64(#Column1#, #Column2#, ..., #ColumnN#) as <> New column name: NewXXColumn .. code-block:: python xxhash64(#30#, #7.8958#) as <> Above example will return -5523311690417587000 as a 64bit hash value. | ============================== Year --------------------- Year( ): The function returns the year part for a given date (a number from 1000 to 9999) Syntax: .. code-block:: python year(#Datecolumn#) as <> New column name: NewYearColumn Example: .. code-block:: python year(#2020-01-23#) as <> Above example returns the number 2020 as an extracted year | ============================== Year_to_date --------------------- Year( ): The function converts 4-digit year(yyyy) to default Date format (yyyy-MM-dd) Syntax: .. code-block:: python year_to_date(#Column#) as <> New column name: NewYTDColumn Visual Analytics ====================== Cluster Analysis ======================== Cluster analysis is a technique to group similar observations into a number of clusters based on the observed values of several variables for each individual. It is an exploratory data analysis technique that allows us to analyze the multivariate data sets. Clustering algorithms ---------------------- K-means ---------------- It is an iterative algorithm that divides the unlabeled dataset into k different clusters. .. figure:: images/clusteranalysis1.png :class: with-shadow :scale: 80 *image_K-means* Bisecting K-Means ------------------------- Bisecting K-Means is a hybrid approach of the K-Means algorithm to produce partitional/hierarchical clustering. It splits one cluster into two sub-clusters at each bisecting step (using k-means) until k clusters are obtained. .. figure:: images/clusteranalysis2.png :class: with-shadow :scale: 80 *image_Bisecting_K-Means* Gaussian Mixture ---------------------- The Gaussian mixture model is defined as a clustering algorithm that is used to discover the underlying groups of data. Gaussian mixture models have a higher chance of finding the right number of clusters in the data compared to k-means. .. figure:: images/clusteranalysis3.png :class: with-shadow :scale: 80 *image_Gaussian_Mixture* Method for Optimal Cluster ------------------------------- Elbow method ------------- Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly. .. figure:: images/optimalcluster1.png :class: with-shadow :scale: 80 *image Elbow method* Silhouette method -------------------------- The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. by providing a succinct graphical representation of how well each object has been classified. .. figure:: images/optimalcluster2.png :class: with-shadow :scale: 80 *image Silhouette method* Graphical Analysis ===================== .. figure:: images/GraphicalAnalysis.png :class: with-shadow :scale: 80 *image GraphicalAnalysis* Graphs are used to display large amounts of numerical data and to represent the relationships between numerical values of different variables. They can also be used to derive quantitative relationships between variables. .. figure:: images/BarChartAge.png :class: with-shadow :scale: 80 *image BarChartAge* Bart chart presents data with rectangular bars with heights or lengths proportional to the values that they represent. .. figure:: images/BarChartAge.png :class: with-shadow :scale: 80 *image BarChartAge* We have chosen age as an example for our bar chart. We can point our mouse at any part of the bar and the system will tell as which point we are currently standing - X axis tells us the location of our mouse and Y axis tells us the total count in the bar. Also we can zoom in or out by the grey tool in the upper part of the graph (highlighted in green). “Group by” tool and “Aggregation” tool works with categorical values. .. note:: We can also select the “Use Distinct” tool - which represents percentage of each category in the pie irrespective of the count in each category. Stack Bar --------------------- The stack bar chart extends the standard bar chart from looking at numeric values across one categorical variable to two, where each bar is divided into a number of sub-bars. .. figure:: images/StackBarSurEmb.png :class: with-shadow :scale: 80 *image StackBarSurEmb* In this Stack Bar chart we have selected two factors - “port of embarkment” and “survived”. We can see that each port of embarkment - 0,1 and 2 is dived into two colours - 1 showing those who survived and 0 showing the opposite. By pointing the mouse on any section of the bar we can see the total number of cases in the group. Box Plot --------------------- Box Plot graph depicts groups of numerical data through their quartiles. Box plots have also lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, these are the outlier values. .. figure:: images/BoxPlotAge.png :class: with-shadow :scale: 80 :width: 98% *image BoxPlot* We used age as an example, we can see how mostly the age of passengers was in between 22 and 35 years, other values re represented as outliers. By pointing at the graph we can also see the *Open, Low, High* and *Close* values. Pie Chart --------------------- Pie charts are beneficial in expressing data and information in terms of proportions. A pie chart is most effective when dealing with the collection of data. It is constructed by clustering all the required data into a circular constructed frame wherein the data are depicted in slices. | .. figure:: images/PieChartMealNo.png :class: with-shadow :scale: 80 *image PieChartMealNo* For our example we used the number of meals and we can see the number was between 1 and 9, each part of the pie is also represented by the percentage and after pointing at each section we can also see the total number of cases Bubble Chart --------------------- A bubble chart is a type of chart that displays three dimensions of data. Each entity with its triplet of associated data is plotted as a disk that expresses two of the vᵢ values through the disk's xy location and the third through its size. .. figure:: images/BubbleChartAgeFare.png :class: with-shadow :scale: 80 *image BubbleChartAgeFare* We used Age and Fare for our example and we can now see how the values are spread in the graph. _______________________________ We can also add another factor in our chart, which represents the “Size” - we selected “Fare” and after that we can see the bubbles size is corresponding with the increasing fare. .. figure:: images/BubbleChartAgeFareSize.png :class: with-shadow :scale: 80 *image BubbleChartAgeFareSize* _______________________________ We have another feature that can help us represent the values effectively - we can add another factor and represent it in colour, in our case we have added “Embarkment point” so we can see the results also differentiated by the point of embarkment .. figure:: images/BubbleChartAgeFareSizeEmb.png :class: with-shadow :scale: 80 *image BubbleChartAgeFareSizeEmb* Correlation Chart --------------------- The depiction address each numerical variable's correlation against all other variables. Bigger the connection better the correlation between the variables. .. figure:: images/Correlation.png :class: with-shadow :scale: 80 *image Correlation* .. figure:: images/Correlationtarget.png :class: with-shadow :scale: 80 *image Correlationtarget* Scatter Graph --------------------- Scatter graph uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables. .. figure:: images/Scatter.png :class: with-shadow :scale: 80 *image Scatter* For our demonstration we have used three different numerical variables - Age as the target on X axis and Fare and Total meal cost on the Y axis. We can even add more numerical variables and all will be displayed in one chart. Parallel Coordinates --------------------- Parallel coordinates are a common way of visualising and analysing high-dimensional datasets. It is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order. .. figure:: images/Parallel.png :class: with-shadow :scale: 80 *image Parallel* We have used Meal Cost, Number of Meals and Total Meal cost to visualise the relation between these variables. Scatter Matrix --------------------- In multivariate statistics and probability theory, the scatter matrix is a statistic that is used to make estimates of the covariance matrix, for instance of the multivariate normal distribution. Time Series --------------------- A time series is a series of data points indexed (or listed or graphed) in time order. .. figure:: images/TimeSeries.png :class: with-shadow :scale: 80 *image TimeSeries* On the X axis we need to use a date variable, in our case “Ticket purchase” and on the Y axis we used “Survived”, for aggregation we have selected Average value (we have also option of Count, Sum, Average, Minimum, Maximum, Median, Standard Deviation and Variance). From our graph we can see, that everybody who purchased the ticket on April 1st survived. Distribution ----------------- A distribution plot displays a distribution and range of a set of numeric values plotted against a target variable(s). .. image:: images/Visualisationdistribution.png :class: with-shadow :scale: 70 *Distribution* Advanced Analysis ================== Outlier analysis ---------------------------- Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset. Also known as outlier detection, it’s an important step in data analysis, as it removes erroneous or inaccurate observations which might otherwise skew conclusions. .. figure:: images/Outlieranalysis.png :class: with-shadow :scale: 80 *image Outlier analysis* Deviation analysis --------------------------------- The measurement of the absolute difference between any one number in a set and the mean of the set. Deviation analysis can be used to determine how a specification will behave in the face of such deviations. .. figure:: images/Deviationanalysis.png :class: with-shadow :scale: 80 *image Deviation analysis* Decision Tree analysis ------------------------- Decision tree analysis is the process of drawing a decision tree, which is a graphic representation of various alternative solutions that are available to solve a given problem, in order to determine the most effective courses of action. Decision trees are comprised of nodes and branches. Nodes represent a test on an attribute and branches represent potential alternative outcomes. .. figure:: images/Decisiontreeanalysis.png :class: with-shadow :scale: 80 *image Decision Tree analysis* Predictive Analytics ===================== .. figure:: images/Predict.png :class: with-shadow :scale: 80 *image Predictive Analytics* After completing our data manipulation and graphical analysis, we can move to the last part, which is the Predictive Analytics. In the “Statistics” part, you need to select all the columns that are important for your research and ignore those that have small or no relation to the target. From the beginning our target for the research is the “Survived” field. Very important step in this research was to convert the field “Survived” from “Numerical” to “Categorical” type, as keeping it as “Numerical” will result in very low accuracy. .. _my-reference-label: * Balancing In following step we can choose the type of balancing - “None”, “Down sampling”, “SMOTE” or “Up sampling” - in our example we selected “None”. Sampling is used to resolve class imbalance, because the imbalance in the observed classes might negatively impact our model. Machine learning algorithms have trouble learning when one class dominates the other so the balancing of data helps us avoid these problems. .. figure:: images/Balancing.png :class: with-shadow :scale: 120 *image Balancing* * None - no sampling of data performed, data stays as it is the dataset. * Down Sampling - the system will randomly subset all the classes in the data set so that their class frequencies match the least common class * SMOTE - Synthetic Minority Over sampling Technique - this technique synthesises a new minority instances between already existing minority instances * Up Sampling - the system will randomly sample and replace the minority class to be the same size as the majority class Next step is to select which algorithms should be used, the ratio (default is 80:20 - which means 80% of the sample is used for training and 20% is used for cross validation) and cross validation number (in how many groups the cross validation sample should be split into) and then click on the “Start learning” button and the system will run these algorithms. Algorithms ------------------ .. figure:: images/Algorithm1.png :class: with-shadow :scale: 80 *image Algorithms* Linear regression ---------------------- Linear regression is the statistical model that analyzes the linear relationship between a scalar response and one or more explanatory variables. Logistic regression ----------------------- Logistic regression is a supervised learning algorithm used to predict a dependent categorical target variable. Decision Tree --------------- Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes, and leaves. Random Forest ---------------------- The random forest is a classification algorithm consisting of many decisions trees. It is an ensemble method that is better than a single decision tree as it reduces over-fitting by averaging the result. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by the committee is more accurate than that of any individual tree. Gradient Boosting Machines -------------------------------- A Gradient Boosting Machine or GBM combines the predictions from multiple decision trees to generate the final predictions. eXtreme Gradient Boosting -------------------------- eXtreme Gradient Boosting or XGBoost is another popular boosting algorithm. In fact, XGBoost is simply an improvised version of the GBM algorithm! The working procedure of XGBoost is the same as GBM. The trees in XGBoost are built sequentially. XGBoost also includes a variety of regularization techniques that reduce overfitting and improve overall performance. Naive Bayes ------------------- The Naive Bayes is a classification algorithm that is suitable for binary and multiclass classification. Naïve Bayes performs well in cases of categorical input variables compared to numerical variables. It is useful for making predictions and forecasting data based on historical results. K-Nearest Neighbors ---------------------- KNN is a non-parametric method used for classification. It is one of the best-known classification algorithms in that known data are arranged in a space defined by the selected features. When new data is supplied to the algorithm, the algorithm will compare the classes of the k closest data to determine the class of the new data. Support Vector Machine -------------------------- Support vector machine algorithm helps to find a hyperplane in N-dimensional space(N — the number of features) that distinctly classifies the data points. Multilayer Perceptron -------------------------- A multilayer perceptron (MLP) is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers. Light GBM -------------- Light Gradient Boosted Machine, or LightGBM for short, is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm. It can be used for data having more than 10,000+ rows. No fixed threshold helps in deciding the usage of LightGBM. It can be used for large volumes of data especially when one needs to achieve high accuracy. We can see the status of each algorithm, some of these are more complicated and will take more time. After all are at 100% of completion, the algorithms are sorted by their accuracy. In our case, we can see 7 algorithms were used. * XGBOOST - decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework - with this algorithm we got an accuracy of 85%. * Decision tree * GBM - Gradient boosting * Random forest * Logistic regression * Naive Bayes * KNN - K nearest neighbors We can also see further details of the algorithm - after clicking on the “Metrices” button. Variable of Importance ------------------------- .. figure:: images/ModelDetailsVI.png :class: with-shadow :scale: 80 *image ModelDetailsVI* In this chart we can see which data field has a higher correlation to our selected target field. Metrics - Classification report --------------------------------- .. figure:: images/ClassificationReport.png :class: with-shadow :scale: 80 *image ClassificationReport* Classification report is used to measure the quality of predictions from a classification algorithm. The metrics are calculated by using true and false positives, true and false negatives. Metrics- Confusion Matrix ---------------------------- .. figure:: images/Confusion Matrix.png :class: with-shadow :scale: 80 *image Confusion Matrix* Confusion matrix, also known as an error matrix, is a specific table layout that allows visualisation of the performance of an algorithm. It shows the “True Positives”, “True Negatives”, “False positives” and “False Negatives”. Metrics- Gain Charts ---------------------- .. figure:: images/Gain.png :class: with-shadow :scale: 80 *image Gain Charts* Cumulative Gain charts are used for visually measuring the effectiveness of classification model and a very popular metrics in predictive analytics. Gain charts measures the performance for a proportion of the population data. It is calculated as the ratio between the results obtained without model (random line) or with the model (the model line). The greater the area between these two line the better the model is. The Y-axis shows the target data and the X-axis shows the percentage of the population. In the image in the Gain Charts, the black line is the random line, and the green and orange Line is results based on the model for the two classes. From the figure above it can be see that orange class model has performed better than the green model, and both the line has performed better than the random line (black). Metrics- Lift Chart ---------------------- .. figure:: images/Lift.png :class: with-shadow :scale: 80 *image Lift Charts* Lift charts are used for visually measuring the effectiveness of classification model and also a popular metrics in predictive analytics. Lift Charts shows the actual lift. The lift curve is determined by measuring the ratio between the result predicted by model and the result using no model. A lift chart graphically represents the improvement that a model has when compared against a random guess, and it measures the change in terms of a lift score. A model can be determined which is best, by comparing the lift scores. A lift chart shows the point at which the model's predictions are strong and when it become weak (less useful). Metrics- K-S Chart --------------------- .. figure:: images/K-S.png :class: with-shadow :scale: 80 *image K-S* K-S is a measure of the degree of separation between the positive and negative distributions. AUC- Area under the ROC Curve --------------------------------- .. figure:: images/AUC.png :class: with-shadow :scale: 80 *image AUC* On the Y axis we can see the True Positive Rate - samples that were evaluated correctly, on the X axis we can see the False Positive Rate - samples evaluated incorrectly. When we plot each of the results, we get the ROC curve. We evaluate the AUC - area under the ROC - if the value is near to zero, the selected model is not classifying correctly. If the value is near to 1, the selected model is classifying correctly. In our case the number is 0.92. Model comparison ------------------ The SEDGE application now allows data analysts to review and compare models by MAPE (mean absolute percentage error), RMSE (root mean square error), MSE (mean squared error), MAE (mean absolute error), and MLCH. This allows the analyst to save and test models for greatest accuracy. .. figure:: images/Modelcompare.png :class: with-shadow :scale: 80 *image Model Comparison* Trained model sharing to users who can access from any other environment. .. figure:: images/Modelsharing.png :class: with-shadow :scale: 80 *image Modelsharing* Actual vs predicted ------------------------ Actual vs. predicted plots are the visualization of actual vs. predicted values for all predictive models. .. figure:: images/actual.png :class: with-shadow :scale: 80 *image Actual vs Predicted* The above is the graph between the actual and predicted values. Shapely importance ------------------- SHAP or Shapley Additive exPlanations is a visualization technique that can be used for making a machine learning model more explainable by visualizing its output. It can be used for explaining the prediction of any model by computing the contribution of each feature to the prediction. If the SHAP value is much closer to zero, we can say that the data point contributes very little to predictions. If the SHAP value is a strong positive or strong negative value, we can say that the data point greatly contributes to predicting the positive or negative class. .. image:: images/Predictionshapelyimportance.png :class: with-shadow :scale: 70 *Shapely importance*